Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement chunking transformer for Weaviate destination #576

Open
5 tasks
burnash opened this issue Aug 22, 2023 · 0 comments
Open
5 tasks

Implement chunking transformer for Weaviate destination #576

burnash opened this issue Aug 22, 2023 · 0 comments

Comments

@burnash
Copy link
Collaborator

burnash commented Aug 22, 2023

Background

#532 introduced support for the Weaviate vector database to dlt. While the support allows users to include specific fields into a vector index and lets Weaviate generate embeddings for data, there is a limitation when dealing with large content. Oversized data requires chunking before it's submitted to Weaviate for processing.

Objective

To provide a more seamless integration with Weaviate, we need to add a transformer that can chunk the data into manageable sizes. This transformer should be flexible, allowing users to define the chunking strategy based on specific heuristics.

Tasks

  • Develop a transformer function that accepts input data and returns it in chunked form.
    • The transformer should have an interface that allows it to accept a custom function, which will define the chunking strategy.
    • Integrate functionality similar to the text splitters from LangChain which can provide heuristic-based content splitting.
  • Include sample heuristics or functions that developers can use or customize for their chunking needs.
  • Update the docs to explain how to use the chunking transformer.

Tests

  • Unit Tests for the transformer
  • Tests the integration with Weaviate destination
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant