Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make model key assignment deterministic #5792

Merged
merged 10 commits into from
Mar 3, 2024

Conversation

lstein
Copy link
Collaborator

@lstein lstein commented Feb 24, 2024

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Community Node Submission

Have you discussed this change with the InvokeAI team?

  • Yes
  • No, because:

Have you updated all relevant documentation?

  • Yes
  • No

Description

Previously, when a model was installed, it was assigned a random database key. This caused issues because if the model was reinstalled, its key would change. It also makes metadata irreproducible. This PR changes the behavior so that keys are assigned determinstically based on the model contents

  • .safetensors, .ckpt and other single file models are hashed with sha1. This is compatible with A1111's model hashes.
  • The contents of diffusers directories are hashed using imohash (faster, but nonstandard)

Related Tickets & Documents

See discord: https://discord.com/channels/1020123559063990373/1149513647022948483/1210441942249377792

  • Related Issue #
  • Closes #

QA Instructions, Screenshots, Recordings

Try installing and uninstalling a model, using the command line invokeai-model-install --add <url or repoid>, followed by invokeai-mode-install --delete <name of model>. Both safetensors URLs and HF diffusers should get the same key each time.

Merge Plan

Can merge when approved.

Added/updated tests?

  • Yes - changed length of expected hash for embedding files.
  • No

[optional] Are there any post deployment tasks we need to perform?

@github-actions github-actions bot added python PRs that change python files backend PRs that change backend files services PRs that change app services PythonTests labels Feb 24, 2024
@lstein lstein force-pushed the feat/deterministic-model-keys branch from 2f9f698 to 225a41d Compare February 24, 2024 16:23
Copy link
Collaborator

@psychedelicious psychedelicious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a problem with the SHA1 implementation. It uses a block size when reading the file, causing the hash to be incorrect.

I was curious so I whipped up a benchmark of file hashing in python: https://github.com/psychedelicious/hash_bench

SHA1 with block size is by far the fastest, but BLAKE3 is still way faster than any other correct algorithm. I think we should use this for single-file models.

CivitAI provides BLAKE3 hashes, and you can query for a model on their API using it: https://civitai.com/api/v1/model-versions/by-hash/39d4e86885924311d47475838b8e10edb7a33c780934e9fc9d38b80c27cb658e

(that's DreamShaper XL something or other - used in my tests)

There are a few ways to implement BLAKE3, see my benchmark repo for the fastest method, which parallelizes hashing and uses memory mapping (I don't really understand what this means but it makes it super fast).

@RyanJDick
Copy link
Collaborator

From a general DB design perspective, I'd be inclined to keep the random primary key and add the hash as another column on the table. (Which I think is what we had discussed when we originally decided to use a random key.)

This would be better if we think there's any chance that the definition of the hash key will change in the future, or it's uniqueness constraint will be dropped. Example situations:

  • If we ever decide to change the hashing function that we use (e.g. for performance, we discover a bug, to map very similar models to the same 'hash', etc.), the migration will be simpler.
  • If we ever decide that we want to track deleted models, having a unique primary key is more practical.
  • If, in the future, we want to run optimizations on models (e.g. TensorRT) and store the optimized artifacts, having a single primary hash key probably won't map well to what we are trying to represent.
  • And probably many more situations that are hard to anticipate...

Up to you if you think the added effort now is worth it for the future-proofing.

@psychedelicious
Copy link
Collaborator

The PK can be anything but we need to be able to reference models using a stable, deterministic identifier.

There's no point in having model metadata if it isn't a stable reference to the same model.

Using a cryptographic hash also means metadata is useful between users.

Also, I think we should consider dropping imohash and instead use b3 to hash the diffusers weights. Iterate over all files and update the hash as you go. No need to rely on the imohash library.

@lstein
Copy link
Collaborator Author

lstein commented Feb 27, 2024

Huh. I didn't realize that I was hashing incorrectly. The web is filled with misleading info, I guess.

The memoryview() method applied to sha256 gives the same answer as sha256sum on the command line and is twice as fast as the linux command-line tool:

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

Python:

time python hash.py ais-stcks-sdxl.safetensors
cf729c0896c2bd69d2a9e5687f5ebe0b44d29879529e2271f8f2d64550485608

real	0m0.591s
user	0m0.554s
sys	0m0.037s

Command line tool:

time sha256sum ais-stcks-sdxl.safetensors 
cf729c0896c2bd69d2a9e5687f5ebe0b44d29879529e2271f8f2d64550485608  ais-stcks-sdxl.safetensors

real	0m1.080s
user	0m1.040s
sys	0m0.041s

The only reason I chose sha1 in the first place is that it produces shorter hashes.

@lstein
Copy link
Collaborator Author

lstein commented Feb 27, 2024

Oh, just saw the blake3 benchmark results. That is very fast. Why don't we just go with that and stick with it?

@lstein
Copy link
Collaborator Author

lstein commented Feb 27, 2024

From a general DB design perspective, I'd be inclined to keep the random primary key and add the hash as another column on the table. (Which I think is what we had discussed when we originally decided to use a random key.)

We are trying to satisfy multiple use cases, ordered in decreasing level of priority.

  1. Models should have stable identifiers that don't change when they are uninstalled and reinstalled.
  2. Users should be able to use the metadata embedded in generated images to reproduce the image.
  3. If a model is transformed from .safetensors to diffusers versions, its identifier shouldn't change.
  4. We want to be able to easily identify a model and recognize what it is. The UI will hide the model ID almost all the time, but there is a use case in which the user is viewing the raw model metadata and trying to figure out what model was used.
  5. We want to check that a downloaded model file or directory has arrived intact.
  6. We want to know if model A on one user's computer is the same as model B on another user's.

With respect to (1), we need to adopt deterministic model identifiers rather than randomly-assigned ones. This seems obvious now, but recall that when we originally discussed the MM2 redesign in response to Spencer's RFC last summer we had a consensus that the identifiers should be random.

Thoughts:

  • Integrity checking:
    • Use cases (5) and (6) are independent of how the models are named and shouldn't be conflated with (1) and (2). We hash the models correctly and store that info for integrity checking and comparison. We don't need to use a hash for the ID.
    • Note that we already store the file hash as an element in the model's config. We just need to adopt a hash algorithm that is consistent with the checksums provided by Civitai and other repos.
    • blake3 is nice and fast. My only reservation is that it is not widely known and I had to search around a bit to find a desktop tool that implemented it.
  • Identifying the model reproducibly:
    • Use cases (1) and (2) are satisfied by any algorithm that deterministically assigns the model ID.
    • (4) is hard. We could discuss reverting to using the base/type/name as the identifier, but we know this leads to confusion when the same model is downloaded from different locations under slightly different names. The one advantage this has is that closely related models, such as the fp16 and fp32 versions, will have similar IDs.
    • Another option is to use the source path or URL, but this is brittle.
    • The hash solution is a good one. The one thing I don't like about it is how ungainly the long length is (for example, it makes it hard to interact with the SQL database). One solution is to shorten the ID by truncating the hex digest to the first 12 characters and still have minimal risk of collisions (less than 1 chance in a quadrillion).
  • Managing format changes:
    • The current code calculates a hash when the model is first installed and then uses this hash as the ID and stores it in the config under current_hash. Later, if the model is converted into diffusers, the ID remains the same, but the hash value is moved to an original_hash field, the model is re-hashed, and the new hash replaces the value in the current_hash field.
    • This solves the safetensors->diffusers conversion issue, but doesn't help resolve the issue of the user having the same model in two different original formats - say SDXL-base.safetensors downloaded from Civitai and SDXL-base diffusers downloaded from HuggingFace. Do we want to try to address this? What about different floating point precisions?

Overall, I think my preference would be to use blake3 for hashing (both safetensors and diffusers directory recursion), to store the hashes in the config for integrity checking, and to use a truncated version of the hash for the model ID. We would also want to provide the user with a UI display element that shows the model ID, its name, its source, its format and its hash, which would help them match models that have been converted or renamed. Finally, maybe the image generation metadata display could be modified to show the model name as well as its ID?

@lstein
Copy link
Collaborator Author

lstein commented Feb 27, 2024

The other thing we should discuss before merging this PR is how hashes are represented in the model config. The pydantic model is currently:

class ModelConfigBase(BaseModel):
    name: str = Field(description="model name")
    key: str =Field(description="model key")
    original_hash: Optional[str] = Field(
        description="original fasthash of model contents", default=None
    )
    current_hash: Optional[str] = Field(
        description="current fasthash of model contents", default=None
    ) 
[irrelevant fields removed]

Civitai computes multiple hashes, and maybe we should allow for similar flexibility in the future. One approach is:

original_hashes: Dict[HashAlgorithm, str] = Field(description="dict of hash algorithms and their resulting hashes")

This would let us apply multiple hashes to support other model sources. Downside is that it would necessitate a database migration.

Another approach would simply to adopt a convention of prefixing the hash with its algorithm name, as in blake3:abcd1234efff0. This wouldn't need a database migration and would give us the flexibility to the hash algorithm in the future.

Lincoln Stein and others added 7 commits March 3, 2024 11:03
- When installing, model keys are now calculated from the model contents.
- .safetensors, .ckpt and other single file models are hashed with sha1
- The contents of diffusers directories are hashed using imohash (faster)

fixup yaml->sql db migration script to assign deterministic key

- this commit also detects and assigns the correct image encoder for
  ip adapter models.
- Some algos are slow, so it is now just called ModelHash
- Added all hashlib algos, plus BLAKE3 and the fast (but incorrect) SHA1 algo
@psychedelicious psychedelicious changed the base branch from next to main March 3, 2024 00:05
@github-actions github-actions bot added python-tests PRs that change python tests python-deps PRs that change python dependencies labels Mar 3, 2024
This changes the functionality of this PR to only use the updated hashing for model hashes with a UUID for the key.
- Use memory view for hashlib algorithms (closer to python 3.11's filehash API in hashlib)
- Remove `sha1_fast` (realized it doesn't even hash the whole file, it just does the first block)
- Add support for custom file filters
- Update docstrings
- Update tests
@psychedelicious psychedelicious enabled auto-merge (rebase) March 3, 2024 03:24
@psychedelicious
Copy link
Collaborator

I've been moving back and forth between this PR and #5846, in which key will be a UUID. I've retained the improvements to the model hash class in this PR, but just made the key to be a UUID.

My recent commits thus change this PR from "Make model key assignment deterministic" to "Improved model hashing".

@psychedelicious psychedelicious self-requested a review March 3, 2024 03:28
@psychedelicious psychedelicious merged commit 2f372d9 into main Mar 3, 2024
14 checks passed
@psychedelicious psychedelicious deleted the feat/deterministic-model-keys branch March 3, 2024 03:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend PRs that change backend files python PRs that change python files python-deps PRs that change python dependencies python-tests PRs that change python tests Root services PRs that change app services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants