total bin coverage for default_transform() in Knowledge Graph transformations #1950

tolgaerdonmez · 2025-03-05T10:40:27Z

Problem

default_transform() uses token lengths up to 100k (0-100k interval) and seperates it into three bins.
But for longer documents with token length >100k and 0 this function raises the following:

 raise ValueError(
            "Documents appears to be too short (ie 100 tokens or less). Please provide longer documents."
        )

Which covers the case of empty documents but also violates the constraint >100k.

Solution (Currently implemented)

I'm not sure with this solution but my first approach was to change the last bin interval to inf. This solves the problem easily but could be inefficient for very large documents.

    bin_ranges = [(0, 100), (101, 500), (501, float("inf"))]

Better Solution Proposal (Let's discuss this)

If the given document is larger than >100k tokens. Seperate the document in half. And start the transformation again, until it fits into the initial bin sizes.

tolgaerdonmez · 2025-03-09T14:30:07Z

I've found another solution:
Seperate the document into half of the total token length using langchains text splitters with overlap.
Use the token counting function as the length function used in ragas itself.

total bin coverage for default transforms

9e2f881

dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

total bin coverage for default_transform() in Knowledge Graph transformations #1950

total bin coverage for default_transform() in Knowledge Graph transformations #1950

Uh oh!

tolgaerdonmez commented Mar 5, 2025

Uh oh!

tolgaerdonmez commented Mar 9, 2025

Uh oh!

Uh oh!

total bin coverage for default_transform() in Knowledge Graph transformations #1950

Are you sure you want to change the base?

total bin coverage for default_transform() in Knowledge Graph transformations #1950

Uh oh!

Conversation

tolgaerdonmez commented Mar 5, 2025

Problem

Solution (Currently implemented)

Better Solution Proposal (Let's discuss this)

Uh oh!

tolgaerdonmez commented Mar 9, 2025

Uh oh!

Uh oh!