-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Add limited langchain
support for Anthropic, Cohere, and OpenAI models
#434
Conversation
Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>
langchain
support for Anthropic, Cohere, and OpenAI modelslangchain
support for Anthropic, Cohere, and OpenAI models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for contributing this @ianwhale, I think it would be amazing to have these in! Have you used them already in any proof of concept? I see you have written YAML examples in the docstrings but we don't have much experience with langchain, so it would help us to have some Python examples as well.
Apart from that, what's missing from your side to declare this ready for a code review?
Hey again @astrojuanlu! Excuse my slow reply, I was out for thanksgiving. I did use this in a PoC. However, I only ever used the YAML api. I'll push some python API examples. |
Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>
Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>
Depending on #629, this may need to move to a contribution folder but I think this is mostly ready |
Hi @ianwhale, thanks so much for your patience with this PR! We're about to launch our new experimental dataset contribution model, which basically means you can contribute datasets that are more experimental and don't have to have full test coverage etc here https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental. I think this PR with datasets would be a perfect first candidate to go into |
# Conflicts: # kedro-datasets/setup.py
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
…kedro-plugins into feat/langchain-dataset
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @merelcht and @ianwhale, nice work! 🚀
Approving the PR with some minor comments. I've tested all datasets on my end and they work as expected.
A couple of thoughts that relate to the topic and we can consider them in future:
- When we worked with the langchain we found it convenient to work with
chains
- that combinellm
andpromt
and provide a standardised interface to call the model with run-time parameters (prompt placeholders). So one can use different llms with the same interface. Example using latest langchain API: https://python.langchain.com/v0.1/docs/integrations/chat/anthropic/ - We also came to dynamic model initialisation, in our case it can help users to switch between different models without need to add extra datasets (OpenAI, Cohere, Azure, etc) with just one
LangchainDataset
. For example thecatalog.yaml
can look like that:
gpt_3_5_turbo:
type: langchain.DataSet
model_type: langchain_openai.ChatOpenAI
kwargs:
model: "gpt-3.5-turbo"
temperature: 0.0
credentials: openai
claude_instant_1:
type: langchain.Dataset
model_type: langchain_anthropic.ChatAnthropic
kwargs:
model: "claude-instant-1"
temperature: 0.0
credentials: anthropic
kedro-datasets/kedro_datasets_experimental/langchain/_anthropic.py
Outdated
Show resolved
Hide resolved
kedro-datasets/kedro_datasets_experimental/langchain/_openai.py
Outdated
Show resolved
Hide resolved
Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
…c.py Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Thanks @ElenaKhaustova for reviewing! I really like your ideas to improve this. I'd suggest merging this version for now and when we have some time or someone from the community can help out we can implement the improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this with:
- Installing
$ uv pip install "kedro-datasets[langchain-chatopenaidataset] @ git+https://github.com/ianwhale/kedro-plugins@feat/langchain-dataset#subdirectory=kedro-datasets"
- Create a Kedro project with 1 pipeline
- Add this node
from langchain_openai.chat_models import ChatOpenAI
from langchain_core.messages.ai import AIMessage
def do_stuff(llm: ChatOpenAI) -> str:
messages = [
("system", "Say 'hello world' in 5 different languages.")
]
result: AIMessage = llm.invoke(messages)
return result.content
- This
catalog.yml
gpt_3_5_turbo:
type: kedro_datasets_experimental.langchain.ChatOpenAIDataset
kwargs:
model: "gpt-3.5-turbo"
temperature: 0.0
credentials: openai
text_dataset:
type: text.TextDataset
filepath: data/output.txt
and credentials.yml
:
openai:
openai_api_base: ""
openai_api_key: sk-test-kedro-langchain-...
And it worked 👏🏼
There's one thing that I think it's worth considering if we ever promote this dataset to official. Usually we tell users that datasets do all the I/O so that nodes don't have to. However, passing an object that's essentially a wrapper around a REST API essentially breaks that rule. Lastly, the fact that these APIs aren't versioned and therefore might return completely different things at different times is a threat to Kedro promise of reproducibility.
On the flip side, we already support our Hugging Face datasets, and I see how a HF model that spits text and an OpenAI wrapper that spits text have the same interface, but work in completely different ways, with I/O happening in different moments. I don't have a good solution for this - it somehow bends Kedro conventions and forces us to think what is the responsibility of the dataset and what should be written in the node.
I'm approving anyway - thanks @ianwhale for your patience and congratulations for contributing the first experimental dataset! 👏🏼
Docs don't show up though 😬 https://docs.kedro.org/projects/kedro-datasets/en/latest/api/kedro_datasets_experimental.html |
I was still planning on polishing before merging, but then it was already merged. Maybe let the assignee/author complete it next time instead of merging as reviewer? |
👍🏼 |
…, and OpenAI models (kedro-org#434) * Add openai datasets. * Add anthropic and cohere Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com> * Add python API examples to docstrings. Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com> * Clean up python example. Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com> * Remove setup.py and move lanchain reqs to pyproject.toml Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Move lanchain datasets to experimental Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Try get antrophic dataset running. Looks like API URL is not necessary? Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Update cohere package and imports Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Update openai dependency + allow for url in antrophic Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Improve Cohere dataset Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Make credentials consistent + fix openai examples Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Turn cohere dataset into chatcohere dataset Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Clean up cohere dataset Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Update release notes + init Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Apply suggestions from code review Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> * Add version pins for langchain dependencies Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Update kedro-datasets/kedro_datasets_experimental/langchain/_anthropic.py Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> * Try loosen pin on langchain-cohere Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Only pin dependencies of dataset def in pyproject.toml Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> --------- Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Merel Theisen <merel.theisen@quantumblack.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com>
Description
Adds limited support for
langchain
models.This PR is a rough starting point for loading langchain API-based models.
The big issue here is langchain's model catalog. See the list here (just for chat models).
There's no way anyone could implement and maintain all of these.
Even if that was desirable, we can see from the
CohereDataset
example that there are going to be lots of details along the way that will make this task difficult.Would love to see what the team thinks and if this is worth pushing forward!
Development notes
Adds four datasets for interacting with langchain models.
Notes from @merelcht after updating the datasets
CohereDataset
to aChatCohereDataset
because I kept getting an error about the number of arguments being passed on. I ran into the same issue even just running the examples in https://python.langchain.com/v0.1/docs/integrations/llms/cohere/Checklist
RELEASE.md
file