Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Add limited langchain support for Anthropic, Cohere, and OpenAI models #434

Merged
merged 23 commits into from
Jun 3, 2024

Conversation

ianwhale
Copy link
Contributor

@ianwhale ianwhale commented Nov 16, 2023

Description

Adds limited support for langchain models.

This PR is a rough starting point for loading langchain API-based models.

The big issue here is langchain's model catalog. See the list here (just for chat models).

There's no way anyone could implement and maintain all of these.

Even if that was desirable, we can see from the CohereDataset example that there are going to be lots of details along the way that will make this task difficult.

Would love to see what the team thinks and if this is worth pushing forward!

Development notes

Adds four datasets for interacting with langchain models.

Notes from @merelcht after updating the datasets

  • Updated the imports and dependencies to the latest
  • I converted the CohereDataset to a ChatCohereDataset because I kept getting an error about the number of arguments being passed on. I ran into the same issue even just running the examples in https://python.langchain.com/v0.1/docs/integrations/llms/cohere/
  • Tested all datasets with the examples provided in the doc string

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>
@ianwhale ianwhale changed the title Add limited langchain support for Anthropic, Cohere, and OpenAI models feat(datasets): Add limited langchain support for Anthropic, Cohere, and OpenAI models Nov 16, 2023
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for contributing this @ianwhale, I think it would be amazing to have these in! Have you used them already in any proof of concept? I see you have written YAML examples in the docstrings but we don't have much experience with langchain, so it would help us to have some Python examples as well.

Apart from that, what's missing from your side to declare this ready for a code review?

@ianwhale
Copy link
Contributor Author

Hey again @astrojuanlu! Excuse my slow reply, I was out for thanksgiving.

I did use this in a PoC. However, I only ever used the YAML api.

I'll push some python API examples.

Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>
Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>
@noklam
Copy link
Contributor

noklam commented Apr 30, 2024

Depending on #629, this may need to move to a contribution folder but I think this is mostly ready

@merelcht merelcht self-assigned this May 14, 2024
@merelcht
Copy link
Member

Hi @ianwhale, thanks so much for your patience with this PR! We're about to launch our new experimental dataset contribution model, which basically means you can contribute datasets that are more experimental and don't have to have full test coverage etc here https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental.

I think this PR with datasets would be a perfect first candidate to go into kedro_datasets_experimental. I don't think there's much else you need to do, other than move it to that directory.

merelcht and others added 13 commits May 21, 2024 14:59
# Conflicts:
#	kedro-datasets/setup.py
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
@merelcht merelcht marked this pull request as ready for review May 30, 2024 09:56
@merelcht merelcht requested a review from ElenaKhaustova May 30, 2024 09:58
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
@merelcht merelcht requested a review from astrojuanlu May 30, 2024 12:21
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @merelcht and @ianwhale, nice work! 🚀

Approving the PR with some minor comments. I've tested all datasets on my end and they work as expected.

A couple of thoughts that relate to the topic and we can consider them in future:

  • When we worked with the langchain we found it convenient to work with chains - that combine llm and promt and provide a standardised interface to call the model with run-time parameters (prompt placeholders). So one can use different llms with the same interface. Example using latest langchain API: https://python.langchain.com/v0.1/docs/integrations/chat/anthropic/
  • We also came to dynamic model initialisation, in our case it can help users to switch between different models without need to add extra datasets (OpenAI, Cohere, Azure, etc) with just one LangchainDataset. For example the catalog.yaml can look like that:
gpt_3_5_turbo:
   type: langchain.DataSet
   model_type: langchain_openai.ChatOpenAI
   kwargs:
     model: "gpt-3.5-turbo"
     temperature: 0.0
   credentials: openai
         
claude_instant_1:
   type: langchain.Dataset
   model_type: langchain_anthropic.ChatAnthropic
   kwargs:
     model: "claude-instant-1"
     temperature: 0.0
   credentials: anthropic

merelcht and others added 5 commits June 3, 2024 10:12
Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
…c.py

Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
@merelcht
Copy link
Member

merelcht commented Jun 3, 2024

A couple of thoughts that relate to the topic and we can consider them in future:

  • When we worked with the langchain we found it convenient to work with chains - that combine llm and promt and provide a standardised interface to call the model with run-time parameters (prompt placeholders). So one can use different llms with the same interface. Example using latest langchain API: https://python.langchain.com/v0.1/docs/integrations/chat/anthropic/
  • We also came to dynamic model initialisation, in our case it can help users to switch between different models without need to add extra datasets (OpenAI, Cohere, Azure, etc) with just one LangchainDataset.

Thanks @ElenaKhaustova for reviewing! I really like your ideas to improve this. I'd suggest merging this version for now and when we have some time or someone from the community can help out we can implement the improvements.

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this with:

  1. Installing
$ uv pip install "kedro-datasets[langchain-chatopenaidataset] @ git+https://github.com/ianwhale/kedro-plugins@feat/langchain-dataset#subdirectory=kedro-datasets"
  1. Create a Kedro project with 1 pipeline
  2. Add this node
from langchain_openai.chat_models import ChatOpenAI
from langchain_core.messages.ai import AIMessage


def do_stuff(llm: ChatOpenAI) -> str:
    messages = [
        ("system", "Say 'hello world' in 5 different languages.")
    ]
    result: AIMessage = llm.invoke(messages)
    return result.content
  1. This catalog.yml
gpt_3_5_turbo:
   type: kedro_datasets_experimental.langchain.ChatOpenAIDataset
   kwargs:
     model: "gpt-3.5-turbo"
     temperature: 0.0
   credentials: openai


text_dataset:
  type: text.TextDataset
  filepath: data/output.txt

and credentials.yml:

openai:
  openai_api_base: ""
  openai_api_key: sk-test-kedro-langchain-...

And it worked 👏🏼

There's one thing that I think it's worth considering if we ever promote this dataset to official. Usually we tell users that datasets do all the I/O so that nodes don't have to. However, passing an object that's essentially a wrapper around a REST API essentially breaks that rule. Lastly, the fact that these APIs aren't versioned and therefore might return completely different things at different times is a threat to Kedro promise of reproducibility.

On the flip side, we already support our Hugging Face datasets, and I see how a HF model that spits text and an OpenAI wrapper that spits text have the same interface, but work in completely different ways, with I/O happening in different moments. I don't have a good solution for this - it somehow bends Kedro conventions and forces us to think what is the responsibility of the dataset and what should be written in the node.

I'm approving anyway - thanks @ianwhale for your patience and congratulations for contributing the first experimental dataset! 👏🏼

@astrojuanlu astrojuanlu merged commit 7f3f3ec into kedro-org:main Jun 3, 2024
14 checks passed
@astrojuanlu
Copy link
Member

@merelcht
Copy link
Member

merelcht commented Jun 3, 2024

I was still planning on polishing before merging, but then it was already merged. Maybe let the assignee/author complete it next time instead of merging as reviewer?

@astrojuanlu
Copy link
Member

Maybe let the assignee/author complete it next time instead of merging as reviewer?

👍🏼

tgoelles pushed a commit to tgoelles/kedro-plugins that referenced this pull request Jun 6, 2024
…, and OpenAI models (kedro-org#434)

* Add openai datasets.

* Add anthropic and cohere

Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>

* Add python API examples to docstrings.

Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>

* Clean up python example.

Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>

* Remove setup.py and move lanchain reqs to pyproject.toml

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Move lanchain datasets to experimental

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Try get antrophic dataset running. Looks like API URL is not necessary?

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Update cohere package and imports

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Update openai dependency + allow for url in antrophic

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Improve Cohere dataset

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Make credentials consistent + fix openai examples

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Turn cohere dataset into chatcohere dataset

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Clean up cohere dataset

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Update release notes + init

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Apply suggestions from code review

Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

* Add version pins for langchain dependencies

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Update kedro-datasets/kedro_datasets_experimental/langchain/_anthropic.py

Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

* Try loosen pin on langchain-cohere

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Only pin dependencies of dataset def in pyproject.toml

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

---------

Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: Merel Theisen <merel.theisen@quantumblack.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants