-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve hash index #2887
Improve hash index #2887
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 5 unresolved discussions (waiting on @srfrog and @manishrjain)
tok/tok.go, line 43 at r1 (raw file):
IdentInt = 0x6 IdentFloat = 0x7 IdentFullText = 0x8
If you're defining them here, then return these in Identifier()
methods.
tok/tok.go, line 375 at r1 (raw file):
func (t HashTokenizer) Identifier() byte { return 0xB } func (t HashTokenizer) IsSortable() bool { return false } func (t HashTokenizer) IsLossy() bool { return true }
Just change that to false, with a comment saying we decide not to check for collisions, and the chance of collisions is very low with the link to the Github issue. It would then automatically not retrieve values.
worker/task.go, line 990 at r1 (raw file):
var keyFn func(int, uint64) []byte if tokenizer.Identifier() == tok.IdentHash {
No need for this and the following changes. See my comment about IsLossy
.
worker/tokens.go, line 90 at r1 (raw file):
tokenizers := schema.State().Tokenizer(attr) tokIdx := -1
Same about lossy. No need for these changes.
x/hash.go, line 21 at r1 (raw file):
import "golang.org/x/crypto/blake2b" func Hash256(data []byte) []byte {
No need for this indirection. blake2b.Sum256 can be called right where it's used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the change is even easier than I thought. If we set the lossy bit to false, hash index would automatically not retrieve values. Just needs thorough testing to ensure there're no side effects.
Reviewable status: 0 of 5 files reviewed, 5 unresolved discussions (waiting on @srfrog and @manishrjain)
fixed bug in getInequalityTokens with fulltext indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 5 unresolved discussions (waiting on @manishrjain)
tok/tok.go, line 43 at r1 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
If you're defining them here, then return these in
Identifier()
methods.
Done.
tok/tok.go, line 375 at r1 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Just change that to false, with a comment saying we decide not to check for collisions, and the chance of collisions is very low with the link to the Github issue. It would then automatically not retrieve values.
Done.
worker/task.go, line 990 at r1 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
No need for this and the following changes. See my comment about
IsLossy
.
Done.
worker/tokens.go, line 90 at r1 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Same about lossy. No need for these changes.
Done.
x/hash.go, line 21 at r1 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
No need for this indirection. blake2b.Sum256 can be called right where it's used.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 6 unresolved discussions (waiting on @manishrjain and @srfrog)
worker/tokens.go, line 38 at r2 (raw file):
} if !schema.State().IsIndexed(attr) {
minor: do you think it's better to move this check to the beginning of this function as the first step? That way, setting the requiredTokenizer variable and using it can be brought closer as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 6 unresolved discussions (waiting on @manishrjain and @gitlw)
worker/tokens.go, line 38 at r2 (raw file):
Previously, gitlw (Lucas Wang) wrote…
minor: do you think it's better to move this check to the beginning of this function as the first step? That way, setting the requiredTokenizer variable and using it can be brought closer as well.
no, unfortunately we need to know the type of tokenizer to respond with the name.
…ue-2776_improve_hash_index
…stom comparison. Simplify pickTokenizer. Added comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a few changes. Have a look and then feel free to merge.
Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions (waiting on @manishrjain and @srfrog)
tok/tok.go, line 375 at r1 (raw file):
Previously, srfrog (Gus) wrote…
Done.
I added the comment.
tok/tok.go, line 127 at r3 (raw file):
id := tokenizer.Identifier() x.AssertTruef(id < IdentCustom,
I think you accidentally flipped the bool around. I fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions (waiting on @manishrjain and @srfrog)
tok/tok.go, line 127 at r3 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
I think you accidentally flipped the bool around. I fixed it.
oops thx
…y (eq) or sortable (le, ge, gt, lt) index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you reintroduced some bugs in your later commits. I retract my LGTM. Needs more review.
Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions (waiting on @manishrjain)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the glog as I mentioned in the comment below. And then feel free to merge it .
Reviewable status: 0 of 6 files reviewed, 4 unresolved discussions (waiting on @manishrjain and @srfrog)
worker/tokens.go, line 103 at r5 (raw file):
return nil, x.Errorf("Attribute:%s does not have proper index for comparison", attr) } glog.Infof("Attribute:%s couldn't find a non-lossy tokenizer for 'eq' comparison", attr)
Remove the glog.
…ue-2776_improve_hash_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 4 unresolved discussions (waiting on @manishrjain)
worker/tokens.go, line 103 at r5 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Remove the glog.
Done.
* saving state * added new fingerprint func using BLAKE2b * renamed function to Hash256 for clarity. * replaced 64 fingerprint hash with Hash256 * pickTokenizer use hash tokenizer when list is lossy. * added tokenizer identifier list for enforcing tokenizer. * compare func using hash index if available and eq won't compare values * fixed minor comment glitches * use tokenizer identifier consts, change hash to non-lossy. * using non-lossy hash so no need for extra logic in handleCompareFunction * simplify pickTokenizer and * simplify pickTokenizer * using tokenizer id * added id value for custom tokenizers, IdentCustom * using tokenizer ids when possible fixed bug in getInequalityTokens with fulltext indexes. * added hash index tests * Manish's review. Fixed a new bug introduced by this PR during IdentCustom comparison. Simplify pickTokenizer. Added comments. * Remove Long term for exact index warning. * fixed logic * pickTokenizer return error when comparison func doesn't have non-lossy (eq) or sortable (le, ge, gt, lt) index * added warning for eq comparison without non-lossy tokenizer * re-fixed this slippery lil bug * removed extra glog
This PR improves "eq" comparison by not having to load values for compare. We use BLAKE2b 256bit hash to tokenize the hash index values.
Closes #2776
This change is