Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update Bark FA2 docs #27400

Merged
merged 5 commits into from
Nov 10, 2023
Merged

Conversation

ylacombe
Copy link
Contributor

@ylacombe ylacombe commented Nov 9, 2023

What does this PR do?

Following @ArthurZucker's review in #27634, I've added a section on FA2 in Bark readme and a mention of Bark FA2 support in a section on FA2!

Note that this comment #27364 (comment) is already addressed since self.dropout is already a float!

Before submitting

  • [w] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

cc @amyeroberts and @ArthurZucker

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 9, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding!

Overall looks good but the comparison values should be updated to make this more useful for readers.

Comment on lines 93 to 95
Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase.

To put this into perspective, you can generate 17 times more text and still be 2s faster than the unoptimized version. At batch size 8, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really a comparison: to do this the specific tests, hardware and runtime numbers should be shown to be useful. "17 times more text" is underfined - do you mean tokens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right oc, I'll be more specific

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to be specific without taking too much time since the tokenizer by default pads the input to 256 tokens, but I'll be more specific in my terms whatsoever!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated! what do you think of this ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better - but still needs some more information and structure.

For example the graphs we see here for mistral. At the moment you're writing a sentence which is giving one data point on the graph without the context of the test being run, and so can't be used to inform any decisions.

It's difficult to be specific without taking too much time since the tokenizer by default pads the input to 256 tokens, but I'll be more specific in my terms whatsoever!

Could you explain this further? What do you mean by too much time? As in to get the numbers? I don't understand the relation to the max length either. Couldn't you run on a dummy model of small context length and benchmark on that? The generations don't have to be good (they can be nonsense) this is just about providing experimental values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically wanted to avoid making the readme more packed than it is already but a graph will do the trick!

Could you explain this further? What do you mean by too much time? As in to get the numbers? I don't understand the relation to the max length either. Couldn't you run on a dummy model of small context length and benchmark on that? The generations don't have to be good (they can be nonsense) this is just about providing experimental values.

I just meant that I've already run the benchmarks and that counting tokens (i.e getting the average context length) wasn't straightforward due to how the the tokenizer works

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor suggestions: would re-order the structure a bit, and maybe explain a bit more where the 17x number comes from.

Think for a high-level comparison these numbers are enough; the docs should serve as an indication of the performance gain we should expect to get, but they're not academic benchmarks, so don't have to be fully water tight.

docs/source/en/model_doc/bark.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/bark.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/bark.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/bark.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/bark.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/bark.md Outdated Show resolved Hide resolved
<img src="https://huggingface.co/datasets/ylacombe/benchmark-comparison/resolve/main/Bark%20Optimization%20Benchmark.png">
</div>

To put this into perspective, on an NVIDIA A100, you can get 17 times the [throughput](https://huggingface.co/blog/optimizing-bark#throughput) and still be 2 seconds faster than the unoptimized, non-batch version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really follow where this 17x number comes from? Are we comparing FA2 batched vs un-optimised non-batched?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, I'll make it clearer

ylacombe and others added 2 commits November 10, 2023 10:52
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
@ylacombe ylacombe requested a review from amyeroberts November 10, 2023 12:50
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The graph is great 📈 - thanks for adding!

@ylacombe
Copy link
Contributor Author

Thanks! Merging!

@ylacombe ylacombe merged commit 9dd58c5 into huggingface:main Nov 10, 2023
3 checks passed
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
* update Bark FA2 docs

* update benchmark section

* Update bark.md

* Apply suggestions from code review

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* rephrase

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants