Airbyte based loaders #8586

flash1293 · 2023-08-01T17:15:12Z

This PR adds 8 new loaders:

AirbyteCDKLoader This reader can wrap and run all python-based Airbyte source connectors.
Separate loaders for the most commonly used APIs:
- AirbyteGongLoader
- AirbyteHubspotLoader
- AirbyteSalesforceLoader
- AirbyteShopifyLoader
- AirbyteStripeLoader
- AirbyteTypeformLoader
- AirbyteZendeskSupportLoader

Documentation and getting started

I added the basic shape of the config to the notebooks. This increases the maintenance effort a bit, but I think it's worth it to make sure people can get started quickly with these important connectors. This is also why I linked the spec and the documentation page in the readme as these two contain all the information to configure a source correctly (e.g. it won't suggest using oauth if that's avoidable even if the connector supports it).

Document generation

The "documents" produced by these loaders won't have a text part (instead, all the record fields are put into the metadata). If a text is required by the use case, the caller needs to do custom transformation suitable for their use case.

Incremental sync

All loaders support incremental syncs if the underlying streams support it. By storing the last_state from the reader instance away and passing it in when loading, it will only load updated records.

vercel · 2023-08-01T17:15:16Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Aug 8, 2023 0:51am

baskaryan · 2023-08-03T06:59:37Z

@flash1293 feel free to ping when this is ready for review

flash1293 · 2023-08-03T15:04:09Z

@baskaryan Should be ready for a look now

pedroslopez

Made some comments on some potential UX/ergonomic improvements, but generally looks good!

Thanks Joe 😄

pedroslopez · 2023-08-03T19:11:30Z

docs/extras/integrations/document_loaders/airbyte_cdk.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!pip install \"source_github@git+https://github.com/airbytehq/airbyte.git@master#subdirectory=airbyte-integrations/connectors/source-github\""


Should these be updated to reference the PyPI connectors?

No, for those they should use the other loaders, this is basically the "escape hatch"

libs/langchain/langchain/document_loaders/airbyte_cdk.py

pedroslopez · 2023-08-03T19:16:53Z

libs/langchain/langchain/document_loaders/airbyte_cdk.py

+        return list(self._load_data(stream_name=self._stream_name, state=self._state))
+
+    def lazy_load(self) -> Iterator[Document]:
+        return self._load_data(stream_name=self._stream_name, state=self._state)


I notice we're passing state through... should the loader also internally update _state with the latest data, or do we want to make it explicit for users to pass through state themselves?

Just thinking about the ergonomics, someone that wants to implement incremental would need to instantiate a new instance of the loader if state changed. Not sure if instead we should pass state in the load method rather than on init.

Thinking about this from the Airbyte PoV, state changes on each run of the source, but the other info like configs/catalog stays the same throughout, which is why I'm leaning a bit towards moving state to the load.

Following the slack convo, if this is part of the langchain interface please ignore me 😛

pedroslopez · 2023-08-03T19:18:52Z

libs/langchain/langchain/document_loaders/airbyte_cdk.py

+from airbyte_protocol.models.airbyte_protocol import AirbyteRecordMessage, AirbyteStateMessage
+from airbyte_cdk.sources.embedded.base_integration import BaseEmbeddedIntegration
+from airbyte_cdk.sources.embedded.runner import CDKRunner


Where are the airbyte_protocol and airbyte_cdk dependencies defined?

Do we expect users to install them separately?

One is a transitive dependency of the other, but airbyte-cdk is actually re-exporting the types, so we can work with just airbyte_protocol here.

pedroslopez · 2023-08-03T19:22:43Z

libs/langchain/langchain/document_loaders/airbyte_cdk.py

+        super().__init__(config=config, runner=CDKRunner(source=source_class(), name=source_class.__name__))
+        self._stream_name = stream_name
+        self._state = state


This might be something on the CDK runner side, but do we want to do any sort of validation on the config? I think if we're not doing so already, a quick win on UX could be to run check with the config on init

I agree that running check makes sense, I will add that in a separate PR on the CDK side

docs/extras/integrations/document_loaders/airbyte_zendesk_support.ipynb

pedroslopez · 2023-08-03T19:28:34Z

docs/extras/integrations/document_loaders/airbyte_zendesk_support.ipynb

+    "    def _handle_record(self, record, id):\n",
+    "        return Document(page_content=record.data[\"title\"], metadata=record.data)\n",


This might be an ok place to start, but I wonder if we can make this easier when creating the loader so don't have to extend the whole class.

I could see this as either a method passed on init, or exposing things like document_text_field: "title" and document_metadata_fields: [] directly.

Changed to a parameter passed into init

…-loader

flash1293 · 2023-08-04T10:21:59Z

Thanks for the review @pedroslopez @eyurtsev @hwchase17 , made some adjustments:

Consolidated the loaders into a single file (airbyte.cdk). I played with having a single "PredefinedSourceLoader" with a string argument, but I think having separate classes is easier to use as your IDE can give you a list of them (and the boilerplate isn't crazy)
Avoid importing anything on the top level
Allow passing in a record_handler to build the document instead of inheritance (also adjusted the notebooks)
Use the import guard to import the individual source to give a nice error message if they are not available
Added gong as it was requested (happy to add others from https://docs.airbyte.com/integrations/ )

Things I'm not sure about:

The record handler has an AirbyteRecordMessage argument, but how can I type that without importing? Not a blocker, but it would be nice to get rid of that Any

baskaryan · 2023-08-08T16:13:55Z

looks good from our end! @pedroslopez any additional thoughts?

baskaryan · 2023-08-08T21:49:23Z

thanks @flash1293!

homanp · 2023-08-09T19:36:33Z

@baskaryan

from langchain.document_loaders.airbyte import AirbyteStripeLoader

Throws:

from libs.langchain.langchain.utils.utils import guard_import
ModuleNotFoundError: No module named 'libs'

aaronsteers · 2023-08-09T20:15:29Z

@baskaryan

from langchain.document_loaders.airbyte import AirbyteStripeLoader

Throws:
from libs.langchain.langchain.utils.utils import guard_import
ModuleNotFoundError: No module named 'libs'

@homanp - Thanks for reporting. I am looking into this. 👀

eyurtsev · 2023-08-09T20:20:50Z

A fix has been merged on master @aaronsteers

eyurtsev · 2023-08-09T20:21:07Z

#8998

aaronsteers · 2023-08-09T20:21:13Z

Great - thanks, @eyurtsev ! 🎉

homanp · 2023-08-09T20:22:31Z

@eyurtsev when will you release the next version?

wip

1f23695

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Aug 1, 2023

add documentation and other connectors

9f7aa5a

flash1293 changed the title ~~[Draft] Airbyte loaders~~ Airbyte based loaders Aug 3, 2023

flash1293 marked this pull request as ready for review August 3, 2023 14:59

pedroslopez reviewed Aug 3, 2023

View reviewed changes

Joe Reuter added 4 commits August 4, 2023 11:05

Merge remote-tracking branch 'upstream/master' into flash1293/airbyte…

34dd9a0

…-loader

review comments

540e306

review comments

02de90f

add gong and fix guarded import

0acd142

vercel bot deployed to Preview – langchain August 4, 2023 10:31 View deployment

adjust readme

647e5c0

vercel bot deployed to Preview – langchain August 4, 2023 11:04 View deployment

lint

810f957

baskaryan merged commit 8f0cd91 into langchain-ai:master Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airbyte based loaders #8586

Airbyte based loaders #8586

flash1293 commented Aug 1, 2023 •

edited

Loading

vercel bot commented Aug 1, 2023 •

edited

Loading

baskaryan commented Aug 3, 2023

flash1293 commented Aug 3, 2023

pedroslopez left a comment

pedroslopez Aug 3, 2023

flash1293 Aug 4, 2023

pedroslopez Aug 3, 2023

pedroslopez Aug 3, 2023

pedroslopez Aug 3, 2023

flash1293 Aug 4, 2023

pedroslopez Aug 3, 2023

flash1293 Aug 4, 2023

pedroslopez Aug 3, 2023

flash1293 Aug 4, 2023

flash1293 commented Aug 4, 2023

baskaryan commented Aug 8, 2023

baskaryan commented Aug 8, 2023

homanp commented Aug 9, 2023

aaronsteers commented Aug 9, 2023

eyurtsev commented Aug 9, 2023

eyurtsev commented Aug 9, 2023

aaronsteers commented Aug 9, 2023

homanp commented Aug 9, 2023

		" def _handle_record(self, record, id):\n",
		" return Document(page_content=record.data[\"title\"], metadata=record.data)\n",

Airbyte based loaders #8586

Airbyte based loaders #8586

Conversation

flash1293 commented Aug 1, 2023 • edited Loading

Documentation and getting started

Document generation

Incremental sync

vercel bot commented Aug 1, 2023 • edited Loading

baskaryan commented Aug 3, 2023

flash1293 commented Aug 3, 2023

pedroslopez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flash1293 commented Aug 4, 2023

baskaryan commented Aug 8, 2023

baskaryan commented Aug 8, 2023

homanp commented Aug 9, 2023

aaronsteers commented Aug 9, 2023

eyurtsev commented Aug 9, 2023

eyurtsev commented Aug 9, 2023

aaronsteers commented Aug 9, 2023

homanp commented Aug 9, 2023

flash1293 commented Aug 1, 2023 •

edited

Loading

vercel bot commented Aug 1, 2023 •

edited

Loading