Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ModelOutput instead of tuples #283

Merged
merged 31 commits into from
Sep 9, 2021

Conversation

adrianeboyd
Copy link
Contributor

@adrianeboyd adrianeboyd commented Jul 22, 2021

  • Save model output as ModelOutput instead of a list of tensors in TransformerData.model_output and FullTransformerBatch.model_output.

    • For backwards compatibility with transformers v3 set return_dict = True in the transformer config.
  • TransformerData.tensors and FullTransformerBatch.tensors return ModelOutput.to_tuple().

  • Store any additional model output as ModelOutput in TransformerData.model_output.

    • Save all torch.Tensor and tuple(torch.Tensor) values in TransformerData.model_output for cases where tensor.shape[0] is the batch size so that it's possible to slice the output for individual docs.
      • Includes: pooler_output, hidden_states, attentions, and cross_attentions
  • Re-enable tests for gpt2 and xlnet in the CI.

  • Following Fixing Transformer IO (attempt 2) #285, include some minor modifications and bug fixes for HFShim and HFObjects

    • Rename the temporary init-only configs in HFObjects and don't serialize them in HFShim once the model is initialized

Unverified

No user is associated with the committer email.
* Use `BaseModelOutput` instead of a list of tensors for the transformer
model output and in `FullTransformerBatch.tensors`.

  * For backwards compatibility with transformers v3 set
    `return_dict = True` in the transforme config.

* Rename `TransformerData.attention` to `TransformerData.attentions` for
consistency with `BaseModelOutput`.

Unverified

No user is associated with the committer email.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
For compatibility with transformers v3.x
@adrianeboyd adrianeboyd marked this pull request as draft July 22, 2021 20:03
* Use `albert-base-v2` to reduce model downloads in tests
* Test tagger training with `output_attentions` enabled
* Check for `pooling_output` (unnamed) in `trf_data`
* Only store `last_hidden_state` in `TransformerData.tensors`

* Move all `torch.Tensor` and `tuple(torch.Tensor)` values into
  `TransformerData.model_output` for cases where `tensor.shape[-2]` is
  the sequence length so that it's possible to slice the output for
  individual docs
  * Includes: `pooler_output`, `hidden_states`, `attentions`, and
    `cross_attentions`
@adrianeboyd adrianeboyd marked this pull request as ready for review July 26, 2021 14:04
@adrianeboyd
Copy link
Contributor Author

@KennethEnevoldsen If you have time, could I ask you to take a look at this PR? I start to go cross-eyed after a bit with all the model formats/outputs and the doc splitting, and since the test suite is a bit sparse, it's possible I've missed something that would cause problems for workflows like yours.

The main API changes (also described above) are that FullTransformerBatch.tensors is ModelOutput instead of a tuple and that TransformerData.model_output contains any additional model output (as much as can be split by batch size) with TransformerData.tensors only containing last_hidden_state, so there's just model_output and no special cases in the code for attentions specifically. It's possible that it might be better or more generic to use the extra annotation setter to set these values, but I think it would require duplicating a lot of the doc splitting code.

FullTransformerBatch.tensors could possibly be aliased or renamed and maybe TransformerData.tensors, too, maybe also with last_hidden_state stored within model_output rather than separately. I'm also not sure whether it's confusing that ModelOutput is used in TransformerData to store non-torch tensors?

@KennethEnevoldsen
Copy link
Contributor

I will take a look at it. It will probably be the start of next week though - hope that is fine

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of this will work just fine. Maybe with the exception of the zero tensor in an empty FullTransformerBatch.

Moving attention to the modelOutput seems reasonable and should also be easier to maintain. It might be slightly too nested. Here I think the extra annotation setters you mentioned is the better solution in terms of usability.

As you mentioned it might cause confusion that FullTransformerBatch.tensors and TransformerData.tensors aren't the same data type (and not a tensor in one case). I think the ideal situation here would be to use last_hidden_state, and model_output (maybe even other features of interest e.g. attention) and then use tensors for backward compatibility?

@adrianeboyd
Copy link
Contributor Author

Thanks for the feedback! Thinking about it again, maybe both data classes should use ModelOutput internally as .model_output and .tensors becomes a property that's .model_output.to_tuple(). It could cause problems for people who are trying to write to tensors (which is rare, hopefully?) and we'd have to be sure that you can still extract last_hidden_state consistently, but that's something we should be able to manage internally.

@KennethEnevoldsen
Copy link
Contributor

I think that is a reasonable approach, I think the assignment to .tensors is rare. Alternatively, you could set it once and allow it to be overwritten though I suspect this may lead to different tensors for modelOutput and .tensors.

In `TransformerData` and `FullTransformerBatch` use `ModelOutput` rather
than a list of tensors as the primary model output data storage. In both
classes, `.tensors` is converted to a property that returns
`model_output.to_tuple()`.
@adrianeboyd
Copy link
Contributor Author

Yeah, I think it would be really bad if they diverge, so let's try the tuple property for .tensors.

As in the documented API, return `List` for `TransformerData.tensors`
and `FullTransformerBatch.tensors`.
@adrianeboyd
Copy link
Contributor Author

This would be v1.1 anyway, so maybe it would better to just go ahead and have .tensors be Tuple instead of List to make it clear that it shouldn't be edited and closer to transformers behavior.

@adrianeboyd adrianeboyd requested a review from honnibal August 10, 2021 11:20
@honnibal
Copy link
Member

I think this looks good. The only thing I'd want to make sure we think about is that the full ModelOutput might include a lot of data, and if the data is on GPU the collection of documents can easily end up consuming all the memory.

The transformers library should let people set what they want stored into the ModelOutput, right? Do we expose that setting conveniently, including exposing it when people have loaded a trained pipeline?

@adrianeboyd
Copy link
Contributor Author

None of the additional model output is enabled by default in transformers as far as I've seen, although I suppose that could change and it's kind of out of our control.

As it is now, the settings are fixed once you've loaded the model. I think this is mostly okay, and I'm not sure I'd want to go to a lot of trouble to let you change it for each batch.

I thought you would be able to do this but you can't override a value that's not there in the config by default, so hmm:

nlp = spacy.load("model", config={"components": {"transformer": {"model": {"transformer_config": {"output_attentions": True}}}}})

Existing config section:

[components.transformer.model.transformer_config]

The permitted values obviously vary by model, but maybe we need to figure out how to add the permitted ones as false in the saved config, although I would hope there might be an easier solution I'm not immediately seeing?

@adrianeboyd
Copy link
Contributor Author

adrianeboyd commented Sep 6, 2021

With thinc v8.0.9, the option above with modifying transformer_config doesn't crash, but it doesn't work due to the modified serialization because it no longer calls huggingface_from_pretrained with the config settings when it gets loaded.

You can change the settings on-the-fly like this:

nlp = spacy.blank("en")
nlp.add_pipe("transformer", config={"model": {"name": "distilbert-base-uncased", "transformer_config": {"output_attentions": True}}})
nlp.initialize()
assert "attentions" in nlp("test")._.trf_data.model_output

nlp.get_pipe("transformer").model.transformer.config.output_attentions = False
assert "attentions" not in nlp("test")._.trf_data.model_output

At this point, parts of the transformer config are in three unsynced places:

# does not include the `return_dict` addition, is only used on initialization
nlp.config["components"]["transformer"]["model"]["transformer_config"]

# includes all the settings, editing on-the-fly works and these are the settings that matter for serialization
nlp.get_pipe("transformer").model.transformer.config

# the internal saved version of the original config.cfg config + `return_dict`,
# but it's not synced (like all model/cfg settings) and also not used on 
# deserialization (unlike other model/cfg settings)
nlp.get_pipe("transformer").model.transformer_config

This is at least one too many, but I have to look at how to avoid saving the third one. Possibly the first one should be moved into [initialize], although that would break what a lot of people are used to. You can also easily override the second one with the third one on deserialization with AutoConfig, which is similar to what happens on initialization with huggingface_from_pretrained, but this has the effect that your config edits as shown above aren't restored on deserialization. (This is actually kind of what actual transformers does, at least for the tokenizer. Settings can be modified on-the-fly but they're not saved because the saved config is stored somewhere else in the object. I don't like this as a design choice, though.)

@adrianeboyd
Copy link
Contributor Author

adrianeboyd commented Sep 6, 2021

Ah, in terms of f2bd2b6 , the equivalent edit would be this on deserialization rather than overriding forward:

config = AutoConfig.from_pretrained(config_file, **trf_config)

I need to look into the details a bit more...

And as a note, I suspect it will be necessary to re-add return_dict on deserialization for v3, since it doesn't seem to be included in the saved config, but I haven't tested it so I'm not entirely sure yet.

Only use `HFObjects.tokenizer/transformer_config` on init and use the
internal configs from `HFObjects.tokenizer/transformer` during
serialization.

To make this clearer, rename the init-only config dict fields in
`HFObjects` to `init_tokenizer_config` and `init_transformer_config`.
This is necessary for transformers==3.4.0, at least.
@adrianeboyd
Copy link
Contributor Author

I tried to split the HFShim edits into a separate PR, but the merges were complicated enough due to the output_attentions changes that I decided to just go ahead and keep it here.

(I swear this will work someday soon.)

@adrianeboyd
Copy link
Contributor Author

So I think I would theoretically prefer to move the transformer model settings into [initialize] but I think this is going to be too difficult of a change to make at this point. Now it's not that different than other components that serialize some internal settings as model/cfg in an unsynced way.

To make it clearer that this is entirely internals, prefix init configs
with underscore: `HFObjects._init_tokenizer/transformer_config`
@adrianeboyd
Copy link
Contributor Author

I think it'd be nicer to move the init configs out of HFObjects entirely, but given the number of layers involved here, it doesn't seem worth the effort at this point if everything works and it's clear (enough) to the user where the settings can be modified.

To handle the case where the pipeline is serialized before the
transformer component is initialized, save the init configs if the model
is not initialized and restore the uninitialized `HFObjects` status.
@adrianeboyd adrianeboyd merged commit 9020c79 into explosion:master Sep 9, 2021
@adrianeboyd
Copy link
Contributor Author

As a note: this wasn't actually working correctly when merged and the culprit looks like it's 4ee6e5a. More to come...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v1.1 Related to v1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants