Skip to content

total bin coverage for default_transform() in Knowledge Graph transformations #1950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tolgaerdonmez
Copy link

Problem

default_transform() uses token lengths up to 100k (0-100k interval) and seperates it into three bins.
But for longer documents with token length >100k and 0 this function raises the following:

 raise ValueError(
            "Documents appears to be too short (ie 100 tokens or less). Please provide longer documents."
        )

Which covers the case of empty documents but also violates the constraint >100k.

Solution (Currently implemented)

I'm not sure with this solution but my first approach was to change the last bin interval to inf. This solves the problem easily but could be inefficient for very large documents.

    bin_ranges = [(0, 100), (101, 500), (501, float("inf"))]

Better Solution Proposal (Let's discuss this)

If the given document is larger than >100k tokens. Seperate the document in half. And start the transformation again, until it fits into the initial bin sizes.

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Mar 5, 2025
@tolgaerdonmez
Copy link
Author

I've found another solution:
Seperate the document into half of the total token length using langchains text splitters with overlap.
Use the token counting function as the length function used in ragas itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant