-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Locking of Text Tower for CLIP
Models
#523
base: main
Are you sure you want to change the base?
Conversation
@rsomani95 thanks for the contribution For this to work in all current cases, the bulk of the logic needs to be in a free function because this should work with both CLIP and CustomTextCLIP (w/ builtin text-tower) via the TextTransformer module, So the TextTransformer needs a lock method that calls the free fn, and the CLIP needs the lock_text_tower method you have that also calls the free fn, both passing in the list of modules/parameters... Looking at the vision lock as a template, the grouping strategy used there is a solultion to your awkwardness and we should be consistent with that, we grouped the input + abs pos embed + any pre-ln into the first group, each transformer block as another group, lumped the last transformer block with the final LN, and final projection as its own...
|
- add free fn that can freeze both `TextTransformer` and `CLIP`
@rwightman that freezing code is a lot cleaner, I've updated mine to mimic the vision lock. Also implemented the free function as I understood it. Let me know if this is in line with what you were describing above. |
The |
@dfan with which model do you encounter that error? |
The |
@rsomani95 With ViT-H-14-378-quickgelu and ViT-H-14-336 I see a type error. I temporarily resolved it using the following version but am yet to run a complete training -
|
I think this should be good to go now. @rohun-tripathi the updated code is a lot simpler, and should work functionally just as well. The LayerNorm freezing code I'd copied over naively was incorrect - I've fixed that now too. @dfan @Yangr116 curious if this works with whatever architectures you'll tried this with. |
@rwightman I couldn't think of a more robust way to extract the LayerNorm layers inside the resblocks: open_clip/src/open_clip/transformer.py Lines 830 to 839 in f065ee6
The assumption here is that the only norm layer being used is LayerNorm. I think, but am not 100% sure, that all existing text models are using LayerNorm? |
Currently, the
lock-text.*
args only work when using aHFTextEncoder
. This PR adds alock_text_tower
method to the 'native'CLIP
class as well.There's two aspects that are different from the
HFTextEncoder
that I'd love feedback on:embeddings
in this case are two separate layers, whereas HF has just oneself.text_projection
andself.ln_final
are passed in as the last two layers respectively. The HF class doesn't have an analogue. So, to unfreeze the last transformer block, you need to pass inlock-text-unlocked-layers 3
, which feels awkward.