Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huggingface agent #2599

Closed
wants to merge 27 commits into from
Closed

Conversation

whiskyboy
Copy link
Collaborator

@whiskyboy whiskyboy commented May 5, 2024

Why are these changes needed?

Introducing a new agent named HuggingFaceAgent which can connect to models in HuggingFace Hub to achieve several multimodal capabilities.

This agent essentially consists of a pairing between an assistant and a user-proxy agent, both are registered with the huggingface-hub models capabilities. Users could seamlessly access this agent to leverage its multimodal capabilities, without the need for manual registration of toolkits for execution.

Some key changes:

  1. added HuggingFaceClient class in autogen/agentchat/contrib/huggingface_utils.py: this class simplifies calling HuggingFace models locally or remotely.
  2. added HuggingFaceAgent class in autogen/agentchat/contrib/huggingface_agent.py: this agent utilizes HuggingFaceClient to achieve multimodal capabilities.
  3. added HuggingFaceImageGenerator class in autogen/agentchat/contrib/capabilities/generate_images.py: this class enables text-based LLMs to generate images using HuggingFaceClient.
  4. added notebook samples to demostrate how these new classes work
  5. fixed some bugs

Related issue number

The second approach mentioned in #2577

Checks

@codecov-commenter
Copy link

codecov-commenter commented May 5, 2024

Codecov Report

Attention: Patch coverage is 4.08922% with 258 lines in your changes missing coverage. Please review.

Project coverage is 19.01%. Comparing base (84c7c24) to head (9155090).
Report is 288 commits behind head on 0.2.

Files with missing lines Patch % Lines
autogen/oai/huggingface.py 3.90% 123 Missing ⚠️
autogen/agentchat/contrib/huggingface_agent.py 0.00% 109 Missing ⚠️
.../agentchat/contrib/capabilities/generate_images.py 0.00% 20 Missing ⚠️
autogen/oai/client.py 40.00% 5 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##              0.2    #2599       +/-   ##
===========================================
- Coverage   33.12%   19.01%   -14.12%     
===========================================
  Files          88       96        +8     
  Lines        9518     9868      +350     
  Branches     2037     2253      +216     
===========================================
- Hits         3153     1876     -1277     
- Misses       6096     7805     +1709     
+ Partials      269      187       -82     
Flag Coverage Δ
unittests 18.97% <4.08%> (-14.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@sonichi sonichi requested a review from BeibinLi May 5, 2024 17:07
@sonichi sonichi added multimodal language + vision, speech etc. integration models Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.) labels May 5, 2024
@WaelKarkoub
Copy link
Contributor

@whiskyboy thanks for the PR! I had a couple of design questions and wanted your opinion on them.

Autogen has an image generation capability, which allows anyone to add text-to-image capabilities to any LLM.

class ImageGeneration(AgentCapability):

What do you think about implementing a new custom ImageGenerator that uses huggingface's apis, as opposed to creating a new agent type? We have dalle image generator implemented for reference.

For image-to-text, we also have a capability called VisionCapability. @BeibinLi has more information on the design choices for that capability but I just wanted to bring it up for awareness.

class VisionCapability(AgentCapability):

@whiskyboy
Copy link
Collaborator Author

@WaelKarkoub Thanks for your comment!
Yes, and in fact I have got inspired and learned a lot from the design of the two capabilities you mentioend above, and also from the MultimodalConversableAgent and LLaVAAgent, during development. Here are my thoughts:

  1. Can we achieve the same functionality within the current multimodal capability implementations?
    Certainly, we can implement a custom ImageGenerator or a custom custom_caption_func to realize the text-to-image and image-to-text capabilities using Huggingface's APIs. However, Huggingface provides the potential of many other multimodal capabilities, such as 'image-to-image', 'audio-to-audio', etc, which go beyond the current implementations. (A full list could be found here.) This draft PR serves as a PoC only now to show how a huggingface agent works. Once we align on the design, I'll proceed with implementing additional capabilities
  2. Should we add a new agent type or should we add some new multimodal capabilities to leveraging Huggingface multimodal models?
    Both designs make sense to me. Introducing a new agent type would allow for covering a diverse range of different multimodal capabilities for general purpose easily, while registering a new capability is more suitable for a specific task. (But we can also have a general capability or register multiple capabilities to one agent. So I'm flexible and open to either approach)
  3. Do we really need a built-in support to Huggingface multimodal models?
    I got the idea inspired from Transformers Agents and JARVIS . It's appealing (to me at least) to have a non-openai and out-of-box solution for adding multimodal capabilities to a text-only LLM in autogen. Huggingface stands out as a suitable choice due to its diverse range of multimodal models spanning from general-purpose to domain-specific areas. Additionally, it offers a cost-effective solution.

@WaelKarkoub
Copy link
Contributor

@whiskyboy This is very cool and I appreciate your efforts! Your reasoning fits well with what I think now. Both approaches could be beneficial to the autogen community and could coexist. We can have standalone huggingface conversible agents as well as huggingface image generators, audio generators, etc.

I look at Autogen as a lego world where users can mix and match different useful tools (lego pieces), and the tools you've developed are valuable and versatile enough to be applicable across many areas (e.g., agent capabilities). For a concrete example, what do you think about breaking down the text-to-image functionality and implementing it as an ImageGenerator that HuggingFaceAgent could also utilize? The HuggingFaceAgent wouldn't implement it as a capability but could directly use this newly decoupled logic. We could apply a similar strategy to other modalities as well.

One last question, is the image-to-image capability the same as image editing? If so, I'm considering improving the image generator capability to allow for this.

@whiskyboy
Copy link
Collaborator Author

whiskyboy commented May 6, 2024

@WaelKarkoub It's glad to know we are working towards the same goal!

what do you think about breaking down the text-to-image functionality and implementing it as an ImageGenerator that HuggingFaceAgent could also utilize?

Sounds like a versatile lego block that could be utilized by both standalone agents and agent capabilities? I think it's a good idea! As it could enhance the function reusability, and make the code more readable and maintainable.

is the image-to-image capability the same as image editing?

Yes, some typical user scenarios include style transfer, image inpainting, etc. For instance, the timbrooks/instruct-pix2pix model could transform a dog in one image into a cat. These models are usually diffusion models that accept a souce image and a prompt text as input.

Copy link

gitguardian bot commented May 27, 2024

⚠️ GitGuardian has uncovered 3 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

Since your pull request originates from a forked repository, GitGuardian is not able to associate the secrets uncovered with secret incidents on your GitGuardian dashboard.
Skipping this check run and merging your pull request will create secret incidents on your GitGuardian dashboard.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
10493810 Triggered Generic Password d422c63 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password d422c63 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password d422c63 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@WaelKarkoub
Copy link
Contributor

#2836 see if this PR could make sense for you as well, we want to add multimodality support for all agents and this is the first step

@whiskyboy
Copy link
Collaborator Author

whiskyboy commented Jun 3, 2024

#2836 see if this PR could make sense for you as well, we want to add multimodality support for all agents and this is the first step

Loving the design!

@whiskyboy
Copy link
Collaborator Author

whiskyboy commented Jun 17, 2024

@WaelKarkoub do you have any more comments on this PR?

@ekzhu ekzhu changed the base branch from main to 0.2 October 2, 2024 18:29
@jackgerrits jackgerrits added the 0.2 Issues which are related to the pre 0.4 codebase label Oct 4, 2024
@rysweet rysweet added the awaiting-op-response Issue or pr has been triaged or responded to and is now awaiting a reply from the original poster label Oct 10, 2024
@rysweet
Copy link
Collaborator

rysweet commented Oct 12, 2024

hi @whiskyboy - thanks so much for this PR - we've rebased it to the 0.2 branch. consider please also updating for 0.4 if you want, or resolving the conflicts with the 0.2 and we will get someone to review further.

@rysweet
Copy link
Collaborator

rysweet commented Oct 18, 2024

closing as stale, please reopen if you would like to update

@rysweet rysweet closed this Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.2 Issues which are related to the pre 0.4 codebase awaiting-op-response Issue or pr has been triaged or responded to and is now awaiting a reply from the original poster models Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.) multimodal language + vision, speech etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants