Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Features] Generic clip #286

Merged
merged 54 commits into from
Feb 5, 2023
Merged

[Features] Generic clip #286

merged 54 commits into from
Feb 5, 2023

Conversation

wanliAlex
Copy link
Collaborator

@wanliAlex wanliAlex commented Jan 25, 2023

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)
    feature

  • What is the current behavior? (You can also link to an open issue here)
    we do not support generic clip model

  • What is the new behavior (if this is a feature change)?
    we now support generic clip model input, e.g., custom openai_clip/open_clip weights
    we also add a minor feature here, the fp16 version of openai_clip models.

  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
    not

  • Have unit tests been run against this PR? (Has there also been any additional testing?)
    working

  • Related Python client changes (link commit/PR here)
    not

  • Related documentation changes (link commit/PR here)
    working

  • Other information:
    this PR solved the issue 256

  • Please check if the PR fulfills these requirements

  • The commit message follows our guidelines
  • Tests for the changes have been added (for bug fixes/features)
  • Docs have been added / updated (for bug fixes / features)

@wanliAlex wanliAlex marked this pull request as draft January 25, 2023 01:07
src/marqo/s2_inference/clip_utils.py Show resolved Hide resolved
@@ -156,15 +185,195 @@ def __init__(self, model_type: str = "ViT-B/32", device: str = 'cpu', embedding
self.processor = None
self.embedding_dimension = embedding_dim
self.truncate = truncate
self.model_properties = kwargs["model_properties"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any validation on "model_properties"? it might be better to have a default
self.model_properties = kwargs.get("model_properties", dict()) (check the syntax, but something like that)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this step is finished in marqo.s2_inference.s2_inference

def _validate_model_properties(model_name: str, model_properties: dict) -> dict:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs to be implemented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved!

self.tokenizer = clip.tokenize
self.model.eval()

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the model_registry entries to check here instead of a try/except?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finished.

I propose we use the key "localpath" or "url" in model_properties to detect whether we use a generic loading.

self.std = self.model_properties.get("std", None)


try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there another way to determine if the model belongs to CLIP or open_clip instead of the try/except?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find an explicit way. I merged two loading functions into one that can load both openai_clip and open_clip models.

src/marqo/s2_inference/clip_utils.py Outdated Show resolved Hide resolved
device_node = [n for n in device_holder.graph.findAllNodes("prim::Constant") if "Device" in repr(n)][-1]

def patch_device(module):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what causes the runtime error here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used anymore.

if "value" in node.attributeNames() and str(node["value"]).startswith("cuda"):
node.copyAttributes(device_node)

model.apply(patch_device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to use the original load functions instead? https://github.com/openai/CLIP/blob/main/clip/clip.py#L94

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this problem is solved by merging into one loading function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function in the link can only load openai clip.

return model, _get_transform(model.visual.input_resolution, self.mean, self.std)


def open_clip_load(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to re-use the code here between clip/open_clip? a quick look seems to be a few similar code snippets

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, merged into one loading function.



def whitespace_clean(text):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function appears to be duplicated with one in custom_clip_utils.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i should delete it.

@jn2clark
Copy link
Contributor

One more thing, can you provide some examples of how a user would invoke this?

@wanliAlex
Copy link
Collaborator Author

wanliAlex commented Jan 25, 2023

Examples:

Example 1

Loading an open_clip custom model from url.

        model_name = "test-model"
        model_properties = {
                            "name": "ViT-B-32-quickgelu",
                            "dimensions": 512,
                            "url": "https://github.com/mlfoundations/open_clip/releases/download/v0.2-weights/vit_b_32-quickgelu-laion400m_e31-d867053b.pt",
                            "type": "open_clip",
                            "jit" : False
                            }
vectorise(model_name = model_name, content = "coco.jpg", model_properties = model_properties)

Example 2

Loading an openai clip custom model from local path.

        model_name = "test-model"
        model_properties = {
                            "name": "ViT-B/32",
                            "dimensions": 512,
                            "url": "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
                            "type": "clip",
                            }

vectorise(model_name = model_name, content = "coco.jpg", model_properties = model_properties)

src/marqo/s2_inference/clip_utils.py Outdated Show resolved Hide resolved
src/marqo/s2_inference/clip_utils.py Outdated Show resolved Hide resolved
src/marqo/s2_inference/clip_utils.py Show resolved Hide resolved
src/marqo/s2_inference/errors.py Outdated Show resolved Hide resolved
tests/s2_inference/test_generic_clip_model.py Show resolved Hide resolved
tests/s2_inference/test_generic_clip_model.py Show resolved Hide resolved
tests/s2_inference/test_generic_clip_model.py Show resolved Hide resolved
src/marqo/s2_inference/s2_inference.py Outdated Show resolved Hide resolved
required_keys = ["name", "dimensions"]
for key in required_keys:
if key not in model_properties:
raise InvalidModelPropertiesError(f"model_properties has missing key '{key}'. ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the error message more helpful? Including a link would also be helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revised

'''
buffer_size = 8192
if not cache_dir:
cache_dir = os.path.expanduser("~/.cache/clip")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make the cache consistent with the rest of our caches ('./cache/...').

See here:

https://github.com/marqo-ai/marqo/blob/e80e53b405ce524ccdb1f77c227535771c73fe43/src/marqo/s2_inference/configs.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. and also update this for other clip models.

Copy link
Collaborator Author

@wanliAlex wanliAlex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

@@ -156,15 +185,195 @@ def __init__(self, model_type: str = "ViT-B/32", device: str = 'cpu', embedding
self.processor = None
self.embedding_dimension = embedding_dim
self.truncate = truncate
self.model_properties = kwargs["model_properties"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this step is finished in marqo.s2_inference.s2_inference

def _validate_model_properties(model_name: str, model_properties: dict) -> dict:

self.tokenizer = clip.tokenize
self.model.eval()

try:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finished.

I propose we use the key "localpath" or "url" in model_properties to detect whether we use a generic loading.

self.std = self.model_properties.get("std", None)


try:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find an explicit way. I merged two loading functions into one that can load both openai_clip and open_clip models.

return model, _get_transform(model.visual.input_resolution, self.mean, self.std)


def open_clip_load(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, merged into one loading function.

if "value" in node.attributeNames() and str(node["value"]).startswith("cuda"):
node.copyAttributes(device_node)

model.apply(patch_device)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this problem is solved by merging into one loading function.

device_node = [n for n in device_holder.graph.findAllNodes("prim::Constant") if "Device" in repr(n)][-1]

def patch_device(module):
try:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used anymore.

tests/s2_inference/test_generic_clip_model.py Outdated Show resolved Hide resolved
tests/s2_inference/test_generic_clip_model.py Show resolved Hide resolved
tests/s2_inference/test_generic_clip_model.py Show resolved Hide resolved
'''
buffer_size = 8192
if not cache_dir:
cache_dir = os.path.expanduser("~/.cache/clip")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. and also update this for other clip models.

@wanliAlex wanliAlex temporarily deployed to marqo-test-suite February 3, 2023 04:11 — with GitHub Actions Inactive
@wanliAlex wanliAlex temporarily deployed to marqo-test-suite February 3, 2023 04:12 — with GitHub Actions Inactive
@wanliAlex wanliAlex temporarily deployed to marqo-test-suite February 3, 2023 04:12 — with GitHub Actions Inactive
@wanliAlex wanliAlex temporarily deployed to marqo-test-suite February 3, 2023 04:12 — with GitHub Actions Inactive
@wanliAlex wanliAlex temporarily deployed to marqo-test-suite February 3, 2023 04:12 — with GitHub Actions Inactive
@wanliAlex wanliAlex temporarily deployed to marqo-test-suite February 3, 2023 04:12 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants