feat(datasets): Add limited `langchain` support for Anthropic, Cohere, and OpenAI models #434

ianwhale · 2023-11-16T22:00:27Z

Description

Adds limited support for langchain models.

This PR is a rough starting point for loading langchain API-based models.

The big issue here is langchain's model catalog. See the list here (just for chat models).

There's no way anyone could implement and maintain all of these.

Even if that was desirable, we can see from the CohereDataset example that there are going to be lots of details along the way that will make this task difficult.

Would love to see what the team thinks and if this is worth pushing forward!

Development notes

Adds four datasets for interacting with langchain models.

Notes from @merelcht after updating the datasets

Updated the imports and dependencies to the latest
I converted the CohereDataset to a ChatCohereDataset because I kept getting an error about the number of arguments being passed on. I ran into the same issue even just running the examples in https://python.langchain.com/v0.1/docs/integrations/llms/cohere/
Tested all datasets with the examples provided in the doc string

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>

astrojuanlu

Thanks a lot for contributing this @ianwhale, I think it would be amazing to have these in! Have you used them already in any proof of concept? I see you have written YAML examples in the docstrings but we don't have much experience with langchain, so it would help us to have some Python examples as well.

Apart from that, what's missing from your side to declare this ready for a code review?

ianwhale · 2023-11-27T15:45:39Z

Hey again @astrojuanlu! Excuse my slow reply, I was out for thanksgiving.

I did use this in a PoC. However, I only ever used the YAML api.

I'll push some python API examples.

Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>

noklam · 2024-04-30T19:54:05Z

Depending on #629, this may need to move to a contribution folder but I think this is mostly ready

merelcht · 2024-05-15T16:20:20Z

Hi @ianwhale, thanks so much for your patience with this PR! We're about to launch our new experimental dataset contribution model, which basically means you can contribute datasets that are more experimental and don't have to have full test coverage etc here https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental.

I think this PR with datasets would be a perfect first candidate to go into kedro_datasets_experimental. I don't think there's much else you need to do, other than move it to that directory.

# Conflicts: # kedro-datasets/setup.py

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

…kedro-plugins into feat/langchain-dataset

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

ElenaKhaustova

Thank you, @merelcht and @ianwhale, nice work! 🚀

Approving the PR with some minor comments. I've tested all datasets on my end and they work as expected.

A couple of thoughts that relate to the topic and we can consider them in future:

When we worked with the langchain we found it convenient to work with chains - that combine llm and promt and provide a standardised interface to call the model with run-time parameters (prompt placeholders). So one can use different llms with the same interface. Example using latest langchain API: https://python.langchain.com/v0.1/docs/integrations/chat/anthropic/
We also came to dynamic model initialisation, in our case it can help users to switch between different models without need to add extra datasets (OpenAI, Cohere, Azure, etc) with just one LangchainDataset. For example the catalog.yaml can look like that:

gpt_3_5_turbo:
   type: langchain.DataSet
   model_type: langchain_openai.ChatOpenAI
   kwargs:
     model: "gpt-3.5-turbo"
     temperature: 0.0
   credentials: openai
         
claude_instant_1:
   type: langchain.Dataset
   model_type: langchain_anthropic.ChatAnthropic
   kwargs:
     model: "claude-instant-1"
     temperature: 0.0
   credentials: anthropic

kedro-datasets/kedro_datasets_experimental/langchain/_cohere.py

kedro-datasets/kedro_datasets_experimental/langchain/_anthropic.py

kedro-datasets/pyproject.toml

kedro-datasets/kedro_datasets_experimental/langchain/_openai.py

Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

…c.py Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht · 2024-06-03T09:35:53Z

A couple of thoughts that relate to the topic and we can consider them in future:

When we worked with the langchain we found it convenient to work with chains - that combine llm and promt and provide a standardised interface to call the model with run-time parameters (prompt placeholders). So one can use different llms with the same interface. Example using latest langchain API: https://python.langchain.com/v0.1/docs/integrations/chat/anthropic/

We also came to dynamic model initialisation, in our case it can help users to switch between different models without need to add extra datasets (OpenAI, Cohere, Azure, etc) with just one LangchainDataset.

Thanks @ElenaKhaustova for reviewing! I really like your ideas to improve this. I'd suggest merging this version for now and when we have some time or someone from the community can help out we can implement the improvements.

astrojuanlu

Tested this with:

Installing

$ uv pip install "kedro-datasets[langchain-chatopenaidataset] @ git+https://github.com/ianwhale/kedro-plugins@feat/langchain-dataset#subdirectory=kedro-datasets"

Create a Kedro project with 1 pipeline
Add this node

from langchain_openai.chat_models import ChatOpenAI
from langchain_core.messages.ai import AIMessage


def do_stuff(llm: ChatOpenAI) -> str:
    messages = [
        ("system", "Say 'hello world' in 5 different languages.")
    ]
    result: AIMessage = llm.invoke(messages)
    return result.content

This catalog.yml

gpt_3_5_turbo:
   type: kedro_datasets_experimental.langchain.ChatOpenAIDataset
   kwargs:
     model: "gpt-3.5-turbo"
     temperature: 0.0
   credentials: openai


text_dataset:
  type: text.TextDataset
  filepath: data/output.txt

and credentials.yml:

openai:
  openai_api_base: ""
  openai_api_key: sk-test-kedro-langchain-...

And it worked 👏🏼

There's one thing that I think it's worth considering if we ever promote this dataset to official. Usually we tell users that datasets do all the I/O so that nodes don't have to. However, passing an object that's essentially a wrapper around a REST API essentially breaks that rule. Lastly, the fact that these APIs aren't versioned and therefore might return completely different things at different times is a threat to Kedro promise of reproducibility.

On the flip side, we already support our Hugging Face datasets, and I see how a HF model that spits text and an OpenAI wrapper that spits text have the same interface, but work in completely different ways, with I/O happening in different moments. I don't have a good solution for this - it somehow bends Kedro conventions and forces us to think what is the responsibility of the dataset and what should be written in the node.

I'm approving anyway - thanks @ianwhale for your patience and congratulations for contributing the first experimental dataset! 👏🏼

astrojuanlu · 2024-06-03T11:32:41Z

Docs don't show up though 😬 https://docs.kedro.org/projects/kedro-datasets/en/latest/api/kedro_datasets_experimental.html

merelcht · 2024-06-03T12:31:42Z

I was still planning on polishing before merging, but then it was already merged. Maybe let the assignee/author complete it next time instead of merging as reviewer?

astrojuanlu · 2024-06-03T13:00:03Z

Maybe let the assignee/author complete it next time instead of merging as reviewer?

👍🏼

…, and OpenAI models (kedro-org#434) * Add openai datasets. * Add anthropic and cohere Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com> * Add python API examples to docstrings. Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com> * Clean up python example. Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com> * Remove setup.py and move lanchain reqs to pyproject.toml Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Move lanchain datasets to experimental Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Try get antrophic dataset running. Looks like API URL is not necessary? Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Update cohere package and imports Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Update openai dependency + allow for url in antrophic Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Improve Cohere dataset Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Make credentials consistent + fix openai examples Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Turn cohere dataset into chatcohere dataset Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Clean up cohere dataset Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Update release notes + init Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Apply suggestions from code review Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> * Add version pins for langchain dependencies Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Update kedro-datasets/kedro_datasets_experimental/langchain/_anthropic.py Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> * Try loosen pin on langchain-cohere Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Only pin dependencies of dataset def in pyproject.toml Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> --------- Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Merel Theisen <merel.theisen@quantumblack.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com>

ianwhale added 2 commits November 16, 2023 15:23

Add openai datasets.

8b6c34c

Add anthropic and cohere

8a3bdfb

Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>

ianwhale changed the title ~~Add limited langchain support for Anthropic, Cohere, and OpenAI models~~ feat(datasets): Add limited langchain support for Anthropic, Cohere, and OpenAI models Nov 16, 2023

astrojuanlu reviewed Nov 20, 2023

View reviewed changes

Add python API examples to docstrings.

a3de75d

Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>

AhdraMeraliQB added the Community Issue/PR opened by the open-source community label Dec 7, 2023

astrojuanlu mentioned this pull request Dec 18, 2023

Investigate the use of a chatbot to form a knowledgebase kedro-org/kedro#2026

Open

merelcht mentioned this pull request Jan 16, 2024

Proposal for Adding Contributions Space for Experimental Datasets #517

Closed

merelcht mentioned this pull request Apr 4, 2024

Experimental dataset contribution model #629

Closed

Clean up python example.

39dd5ae

Signed-off-by: Ian Whalen <ianpatrickwhalen@gmail.com>

merelcht self-assigned this May 14, 2024

merelcht and others added 13 commits May 21, 2024 14:59

Merge branch 'main' into feat/langchain-dataset

b38786f

# Conflicts: # kedro-datasets/setup.py

Remove setup.py and move lanchain reqs to pyproject.toml

72cf548

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Move lanchain datasets to experimental

de2596b

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Try get antrophic dataset running. Looks like API URL is not necessary?

b67c43f

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Update cohere package and imports

0fab1f6

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Merge branch 'main' into feat/langchain-dataset

5e059f3

Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

Update openai dependency + allow for url in antrophic

6d9ba95

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Merge branch 'feat/langchain-dataset' of https://github.com/ianwhale/…

4d60267

…kedro-plugins into feat/langchain-dataset

Improve Cohere dataset

05d8573

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Make credentials consistent + fix openai examples

82de16a

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Turn cohere dataset into chatcohere dataset

f865aa6

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Clean up cohere dataset

2d805d9

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Merge branch 'main' into feat/langchain-dataset

56ce7ff

merelcht marked this pull request as ready for review May 30, 2024 09:56

merelcht requested a review from ElenaKhaustova May 30, 2024 09:58

Update release notes + init

4d726b1

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

merelcht requested a review from astrojuanlu May 30, 2024 12:21

ElenaKhaustova approved these changes May 31, 2024

View reviewed changes

merelcht and others added 5 commits June 3, 2024 10:12

Apply suggestions from code review

89c49a1

Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

Add version pins for langchain dependencies

0d88147

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Update kedro-datasets/kedro_datasets_experimental/langchain/_anthropi…

8b2d578

…c.py Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

Try loosen pin on langchain-cohere

acec1d5

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Only pin dependencies of dataset def in pyproject.toml

dac5066

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

astrojuanlu approved these changes Jun 3, 2024

View reviewed changes

astrojuanlu merged commit 7f3f3ec into kedro-org:main Jun 3, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): Add limited `langchain` support for Anthropic, Cohere, and OpenAI models #434

feat(datasets): Add limited `langchain` support for Anthropic, Cohere, and OpenAI models #434

ianwhale commented Nov 16, 2023 •

edited by merelcht

Loading

astrojuanlu left a comment

ianwhale commented Nov 27, 2023

noklam commented Apr 30, 2024

merelcht commented May 15, 2024

ElenaKhaustova left a comment •

edited

Loading

merelcht commented Jun 3, 2024

astrojuanlu left a comment

astrojuanlu commented Jun 3, 2024

merelcht commented Jun 3, 2024

astrojuanlu commented Jun 3, 2024

feat(datasets): Add limited langchain support for Anthropic, Cohere, and OpenAI models #434

feat(datasets): Add limited langchain support for Anthropic, Cohere, and OpenAI models #434

Conversation

ianwhale commented Nov 16, 2023 • edited by merelcht Loading

Description

Development notes

Checklist

astrojuanlu left a comment

Choose a reason for hiding this comment

ianwhale commented Nov 27, 2023

noklam commented Apr 30, 2024

merelcht commented May 15, 2024

ElenaKhaustova left a comment • edited Loading

Choose a reason for hiding this comment

merelcht commented Jun 3, 2024

astrojuanlu left a comment

Choose a reason for hiding this comment

astrojuanlu commented Jun 3, 2024

merelcht commented Jun 3, 2024

astrojuanlu commented Jun 3, 2024

feat(datasets): Add limited `langchain` support for Anthropic, Cohere, and OpenAI models #434

feat(datasets): Add limited `langchain` support for Anthropic, Cohere, and OpenAI models #434

ianwhale commented Nov 16, 2023 •

edited by merelcht

Loading

ElenaKhaustova left a comment •

edited

Loading