Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop recreating the hashmap all the time. #363

Merged
merged 2 commits into from
Sep 18, 2023
Merged

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented Sep 18, 2023

What does this PR do?

This is a different approach from #362 but should fix the same issue in the same way.

This doesn't change any external python API, and makes every existing code faster (by just using the structurs smarter).
This is better IMO as it fixes every potential use for the safe_open method, whereas #362 introduces a new layer that has to be used in order to benefit (so only users of safetensors.torch.load_file would actually see the difference).

The crux of the problem was Metadata::tensors which is a method that creates a full HashMap<String, String> to access the tensors. The structure used to be exactly that and just give directly the structure to callers, therefore relatively free. But since, the structure was refactored into a HashMap<String, usize> and Vec<TensorInfo>.
The main reason for it, is that the Vec is intrinsically ordered, and we maintain the data_offsets ordered in order to get efficient comparison when checking for malformed files (with overlapping data_offsets).
It also helps in during the serialization where we already have those information sorted so we write them sorted into the JSON header.

Fixes #361
Potentially superseeds #362

Fixes # (issue) or description of the problem this PR solves.

@Narsil Narsil force-pushed the fix_small_tensors_performance branch from 11f6379 to bc466b5 Compare September 18, 2023 13:13
Narsil and others added 2 commits September 18, 2023 15:14
Fixes #361
Potentially superseeds #362

Co-Authored-By: Batuhan Taskaya <batuhan@python.org>
@Narsil Narsil merged commit 6eb419f into main Sep 18, 2023
10 checks passed
@Narsil Narsil deleted the fix_small_tensors_performance branch September 18, 2023 14:45
@FurkanGozukara
Copy link

do you have account on twitter? i would like to follow you

good job

@Narsil
Copy link
Collaborator Author

Narsil commented Sep 18, 2023

https://twitter.com/narsilou
I'm not very active though :)

@FurkanGozukara
Copy link

https://twitter.com/narsilou I'm not very active though :)

Followed thanks

RyanJDick added a commit to invoke-ai/InvokeAI that referenced this pull request Oct 11, 2023
## What type of PR is this? (check all applicable)

- [ ] Refactor
- [ ] Feature
- [ ] Bug Fix
- [x] Optimization
- [ ] Documentation Update
- [ ] Community Node Submission

## Description

@Millu pointed out this safetensors PR a few weeks ago, which claimed to
offer a performance benefit:
huggingface/safetensors#362 . It was superseded
by huggingface/safetensors#363 and included in
the latest [safetensors 0.4.0
release](https://github.com/huggingface/safetensors/releases/tag/v0.4.0).

Here are the results from my local performance comparison:
```
Before(0.3.1) / After(0.4.0)

sdxl:main:tokenizer from disk to cpu in                              0.46s / 0.46s
sdxl:main:text_encoder from disk to cpu in                           2.12s / 2.32s
embroidered_style_v1_sdxl.safetensors:sdxl:lora' from disk to cpu in 0.67s / 0.36s
VoxelXL_v1.safetensors:sdxl:lora' from disk to cpu in                1.64s / 0.60s
ryan_db_sdxl_epoch640.safetensors:sdxl:lora' from disk to cpu in     2.46s / 1.40s
sdxl:main:tokenizer_2 from disk to cpu in                            0.37s / 0.39s
sdxl:main:text_encoder_2 from disk to cpu in                         3.78s / 4.70s
sdxl:main:unet from disk to cpu in                                   4.66s / 3.08s
sdxl:main:scheduler from disk to cpu in                              0.34s / 0.33s
sdxl:main:vae from disk to cpu in                                    0.66s / 0.51s

TOTAL GRAPH EXECUTION TIME:                                        56.489s / 53.416s
```

The benefit was marginal on my system (maybe even within measurement
error), but I figured we might as well pull it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading shallow/augmenting weights (such as LoRAs) with a lot of layers are 10X slower than pure torch
2 participants