-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solving issue of differents ckpt with same hash #2459
Conversation
…e it to a sha256 file
@raefu had a nice and consistently fast solution to this: hashing the zip directory section at the end of the file so it's a hash of the attributes and crcs of all the contents. |
@dfaker cool, but I don't know how to do it, so someone will need to help with that solution. |
Silly question - would this break existing hashes stored in images? If so, then I'd definitely want an option to enable this or not. Or use a "hashv2" param in infotext or something. |
@d8ahazard yes, the new hash will be different from the old one, there is a option to enable this, the default is disabled, but your idea to add something to differentiate in the info text is good, I think we should discurse about it and find a good way to differentiate from the hash method used. Some options:
|
it's a small section at the end of the file, starting with the signature 0x02014b50, taking the last MB of the .pt sould capture it. |
the biggest problem is what we do with all current model hashes |
I like the
Just riffing off the top of my head here and haven't fully thought this through, but if the original hashing method is kept, alongside this new method (particularly if the new method designates itself as a v2/etc in the hash in some identifiable way), then presumably it would be possible to look up the hash either in both 'old' (current) v1 way, or the new v2 way. There will still be the edgecases where the current v1 hash clashes for distinct models obviously, but in those cases perhaps it could just show a list of the models that match, and potentially offer to upgrade the embedded hash to the v2 hash. (I'm not actually familiar with the workflow around how the hashes are used, so if the above doesn't match the reality of how they're used, adjust/ignore as appropriate) eg.
|
Also, just as a 'prior art' reference, this (hashing the entire # ~/dev/stable-diffusion/InvokeAI/models/ldm/stable-diffusion-v1
⇒ ls
model.ckpt model.sha256
⇒ cat model.sha256
fe4efff1e174c627256e44ec2991ba279b3816e364b49f9be2abc0b3ff3f8556%
⇒ time sha256sum --binary model.ckpt
fe4efff1e174c627256e44ec2991ba279b3816e364b49f9be2abc0b3ff3f8556 *model.ckpt
sha256sum --binary model.ckpt 18.98s user 0.60s system 99% cpu 19.728 total |
@0xdevalias thank you, I was thinking the same, I just didn't have time to code it, I will take a look in that InvokeAI code to see if it helps. |
So sounds like you could probably open the Some quick Google/StackOverflow results with some example code:
Though this might be a little 'low-level', and may just be worth seeing if it's possible to use an existing python zip lib such as: |
So my brain got curious, and decided to dive into writing a little PoC script for this: This script will efficiently read the Running it looks like this: ⇒ ./quick-zip-sha256hash.py
>> *.ckpt file looks good!
>> Calculating sha256 hash of the Zip Central Directory from the *.ckpt weights file
>> sha256(ckpt_cd) = 685bf114177d8ed310eead5838d4ca5aa6e396a64ab978ca91a0dbfcb6247f02 (0.00s) The code also writes the hash out to ⇒ cat model.sha256-cd
685bf114177d8ed310eead5838d4ca5aa6e396a64ab978ca91a0dbfcb6247f02% Note that |
@0xdevalias Sorry, your code didn't worked for other models, I got an error for the model |
I don't have that particular You can see the code that raises that error here. Essentially it seeks to the file offset that the EOCD told it should be the start of the CD record, then attempts to read in the length of the record's bytes. It then checks the first 4 bytes to see if they match the 'magic number' that defines the EOCD record start, and if not, throws that error: # Seek to where we expect the Central Directory (CD) record to start and read it in
fh.seek(ckpt_eocd_cd_offset, os.SEEK_SET)
ckpt_cd = fh.read(ckpt_eocd_cd_size_bytes)
# https://en.wikipedia.org/wiki/ZIP_(file_format)#Central_directory_file_header
ckpt_cd_sig = ckpt_cd[0:4]
if ckpt_cd_sig != cd_sig:
raise Exception("Didn't find the *.ckpt Zip file Central Directory (CD) signature where we expected to. Is the *.ckpt corrupted?") Original message I wrote before I realised I read the error you pasted wrong; in case it's helpful stillI don't have that particular if ckpt_eocd_sig != eocd_sig:
raise Exception("Didn't find the *.ckpt Zip file End of Central Directory (EOCD) signature where we expected to. Is the *.ckpt corrupted, or does the Zip file have comments in it?")
# NOTE: If the Zip file has comments, then you'd need to seek further back in chunks, and search for the EOCD signature Shouldn't be too hard to implement, as all the bits and pieces are already there. But I'll leave that as an exploration/exercise for the reader to implement :) (aka: feel free to iterate on my proof of concept to make it more robust/cover edge cases/etc) |
My approach is to sum the CRC32's of the files inside the /archive folder and subdirectories (/archive/data). See #4478 To be clear, it's more of a quick and dirty content-hash rather than a true hash, but for my intended purpose, it's fast and generates unique values. The reason being, I would like for there to be support for an embedded diffusion-specific metadata file, containing info about the model, most especially the trigger words and descriptions. By limiting the hash function to model-specific files, we can freely inject and modify metadata without worrying about breaking the hash. The hash function can be a sum of CRC32s, or a SHA256 of the concatenation, whichever is more unique. From my limited tests a CRC32 sum of all model files is good enough to make unique checkpoints from most checkpoints, even ones with similar base trainings, and is as fast as the current method. The SD community sorely needs to embed metadata into ckpts, with the explosion of different models and their trained triggers. Currently if you download a dreambooth model, there is no way to know the triggers unless you find the original download site or the post in reddit that the author wrote. Having an embedded metadata that documents the triggers and other info will be extremely useful to the community. Also suggest keeping the existing hash, but adding a hash-v2 in the UI and PNGInfo to avoid breaking existing hashes. Eventually(?) we can phase out the old hash. Here's my implementation of the v2 hash:
|
I can confirm that @RupertAvery way of hashing produces unique hashes, and it is very fast. I also like the idea of adding metadata to the ckpt files, as they are proliferating in number, and they are becoming increasingly difficult to organize. Can metadata be added to the textual inversion embeddings, hypernetwork embeddings, and aesthetic gradients too? Or would this have to be a separate file? Should these also have a hash to uniquely identify them? |
@RupertAvery @Jonseed That's basically the same solution proposed earlier in this PR:
And PoC implemented by me above:
@RupertAvery See prior discussions above in this PR about exactly that:
👏🏻👌🏻 Agreed. |
I made a new PR #4546 with the code of this PR and the code suggested by @RupertAvery in #2459 (comment) @RupertAvery just to be clear, I did this PR before your feature request #4478, the proposal of this PR is to solve the ckpt with the same hash while keeping a backward compatibility, which this PR does, if you wish to add a metadata file inside the ckpt file feel free to open a PR with that code later. |
So I just tried loading a checkpoint with a file It's an easy fix, not one I'm entirely happy with.
|
This problem still exists, right? |
With safetensors, the proposed method of hashing the CRCs no longer works, because safetensors aren't zip files. This leads to the question, how can we hash safetensors properly? @JustMaier I've also thought about indexing the civitai models, and if it's possible to just just the part of the file necessary to compute the hashes, of maybce ask civitai to expose the hashes in an API |
So I'm the guy behind Civitai, we don't have the hashes right now because I wasn't sure what was going to be done about this issue and I didn't want to pull everything down for hashing more than once. Since CRC hashing isn't an option for safetensors I think the standard should be SHA256. If I understand the concern with that, is that it will take longer to compute, right? |
Hi! The problem with computing hashes is that Automatic1111 does it in real time, and that's probably why such shortcut method was used. If we're going to add a new format anyway, i.e. safetensors, then I advocate creating a standard container for checkpoints, safetensor OR ckpt, with metadata to say which it is, what the hash is, and have a whole lot of space for author information. Just a 2MB header with space for plaintext json metadata, and a precomputed SHA256 would be good enough. Though, 2MB is probably overkill. Having that empy padded space would allow authors to freely edit their metadata, without having to repack the actual weights. Also, the sha256 is just there for anyone who wants to actually check it. We just need tools to enable aothors to move to the new format, and support from the WebUIs to read it. Like they say, build it, and they will come. It's an additional burden on finetuners, but hopefully dreambooth UIs can integrate this into their process, It isn't even a new format, it's just an additional header, with the actual data offset. I don't know how easy it is for pytorch to load a file from an offset instead of directly though. If it is possible, we don't even have to break anything, it's just an alternate way of loading checkpoints into memory. Doesn anybody know if this is feasible? also, what's a good extension for this container format? .diffusion? |
I like the idea of this, there is plenty of metadata that could and should be included in models (think merge tracking etc), the challenge would be propagating the standard. I think this could be helped with easy-to-use python packages that make implementing the standard easier and additional tools that can be used by end users to add metadata to existing AI art resources (checkpoints, embeds, hypernetworks, etc). Additionally, I'd like to think that if we implemented it as part of Civitai, that we could automatically apply this metadata to the 1,284 resources we're currently housing and help start the trend. One bonus of coming up with a metadata standard is that it can continue to be used as new checkpoint formats or other AI art resources are released. For example, LORAs just hit the scene, wouldn't it be great if they were able to include the same metadata format. |
Yea, and we could extend this concept to embeddings, hypernetworks. Different extension perhaps, but having a header there. I'm the author of Diffusion Toolkit, https://github.com/RupertAvery/DiffusionToolkit and I see that as a way to make it accessible to windows users. I plan to put more checkpoint related tooling into it anyway. |
I wonder if there is a way to include the metadata without making the files not work in tools that don’t support the metadata so that adoption doesn’t have to be blocked by “waiting for my favorite generator to support the format” great tool btw. I need to dig into it to see how you’ve pulled out the metadata from each of those tools. I need the same thing on Civitai. Right now it only supports AUTO metadata. Got a file I should look at? also if your serious about this, we should start a proposal somewhere to see if we can gather any input and get a few tool maintainers on board. |
That was my original goal with ckpts, since they are zip files, they can contain anything without breaking the loading, you just have to tweak some scripts to allow the file we're going to add. The possibility of that went away with safetensors. Another way of course is just to have a .json file next to the ckpt. Instant "metadata". Even then, we still have to somehow add support to GUIs to read the metadata and display it somewhere useful. It would help if we could get someone already familiar with gradio and a1111's gui code on board. Like maybe an extension author. I'm willing to dive in, but I'm a little busy with other things right now.
Everything is in Metadata.cs. If you look at the closed issues, there will be images there with test data.
We'll just have to try to push this forward as much as possible by making a branch and promoting it (like, telling anyone willing to try the fork). Unfortunately, not everyone will be git-savvy or waiting for us to merge the constant infux of commits from the main repo. By including metadata authoring ni DIffusion Toolkit, I hope to generate some hype for it. I could start with storing it in a side-along json file, or in the database. It's only useful to Diffusion Toolkit, but at least users can check manage their models a bit. I actually thought of loading sample images and other information from civitai when viewing models in Toolkit, and was hoping to reach out to you for that, but I don't know if that's okay (probably isn't). You have a great site and really contributes to making models searchable, accessible and documenting their information (triggers) where possible. I have started something by building a file wrapper that SHOULD make it so that, when it gets read by the consumer, it offsets everything. It kind of works, right now I'm testing it with an offset of zero, so I don't have to actually offset everything, just a proof of concept. It works for the zip file loader part (i'm testing it on a ckpt renamed to .diffusion) but as soon as it gets into torch.load, I get this:
This is the wrapper
And this is where I inject it in
This is going way off topic of course. We should definitely start some issue or something somewhere else, focused on making a container format and defining its spec. But where so that we can still get good visibility from other devs? |
See this repository for more information and as a place for discussion on the proposed container format: |
The two hashes are different:
filename + '.sha256'
|
All this PR was made before the safetensors exist in this project, this PR isn't going anywhere, because auto didn't show interest in a new hash method, so I won't bother to update it until there is some chance of a new hash method to be accepted in this repo. |
This extension is working well |
Issue solved by a95f135 |
* Revert "Revert "Add Densepose (TorchScript)"" * 🐛 Fix unload
Issue
This issue may happen with specific ckpt files when merged with interpolations pairs that add up to 1 like you can see in the screenshot bellow:
First I thought it could be just the first characters, but it turns out the whole hash is equal, as you can see:
Checking the files proves their content are different:
So just the bytes being used to create the hash are equal as I could verify:
Solution
To solve this issue I added a option to create the hash using the entire model and save it to a
.sha256
file next to the.ckpt
file in the first execution and reading from it in the following executions.The content of the
sha256
file follow the default sha256 format, allowing to be checked using the commandsha256sum -c model.sha256
and you can even create thesha256
file yourself using a command likesha256sum -b model.ckpt > model.sha256
or equivalent and upload it to the models folder along with the ckpt to avoid the hash generation time on the first start.This solves the issue of differents ckpt files with the same hash without breaking the code to who don't have this issue, and just the first execution is slower, the following executions are as fast as the default hash code.
The new hashes for comparison:
Environment this was tested in