total bin coverage for default_transform() in Knowledge Graph transformations #1950
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
default_transform() uses token lengths up to 100k (0-100k interval) and seperates it into three bins.
But for longer documents with token length >100k and 0 this function raises the following:
Which covers the case of empty documents but also violates the constraint >100k.
Solution (Currently implemented)
I'm not sure with this solution but my first approach was to change the last bin interval to
inf
. This solves the problem easily but could be inefficient for very large documents.Better Solution Proposal (Let's discuss this)
If the given document is larger than >100k tokens. Seperate the document in half. And start the transformation again, until it fits into the initial bin sizes.