Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve hash index #2887

Merged
merged 25 commits into from
Jan 16, 2019
Merged

Improve hash index #2887

merged 25 commits into from
Jan 16, 2019

Conversation

srfrog
Copy link
Contributor

@srfrog srfrog commented Jan 10, 2019

This PR improves "eq" comparison by not having to load values for compare. We use BLAKE2b 256bit hash to tokenize the hash index values.

Closes #2776


This change is Reviewable

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 5 unresolved discussions (waiting on @srfrog and @manishrjain)


tok/tok.go, line 43 at r1 (raw file):

	IdentInt      = 0x6
	IdentFloat    = 0x7
	IdentFullText = 0x8

If you're defining them here, then return these in Identifier() methods.


tok/tok.go, line 375 at r1 (raw file):

func (t HashTokenizer) Identifier() byte { return 0xB }
func (t HashTokenizer) IsSortable() bool { return false }
func (t HashTokenizer) IsLossy() bool    { return true }

Just change that to false, with a comment saying we decide not to check for collisions, and the chance of collisions is very low with the link to the Github issue. It would then automatically not retrieve values.


worker/task.go, line 990 at r1 (raw file):

		var keyFn func(int, uint64) []byte
		if tokenizer.Identifier() == tok.IdentHash {

No need for this and the following changes. See my comment about IsLossy.


worker/tokens.go, line 90 at r1 (raw file):

	tokenizers := schema.State().Tokenizer(attr)

	tokIdx := -1

Same about lossy. No need for these changes.


x/hash.go, line 21 at r1 (raw file):

import "golang.org/x/crypto/blake2b"

func Hash256(data []byte) []byte {

No need for this indirection. blake2b.Sum256 can be called right where it's used.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change is even easier than I thought. If we set the lossy bit to false, hash index would automatically not retrieve values. Just needs thorough testing to ensure there're no side effects.

Reviewable status: 0 of 5 files reviewed, 5 unresolved discussions (waiting on @srfrog and @manishrjain)

Copy link
Contributor Author

@srfrog srfrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 5 unresolved discussions (waiting on @manishrjain)


tok/tok.go, line 43 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

If you're defining them here, then return these in Identifier() methods.

Done.


tok/tok.go, line 375 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Just change that to false, with a comment saying we decide not to check for collisions, and the chance of collisions is very low with the link to the Github issue. It would then automatically not retrieve values.

Done.


worker/task.go, line 990 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

No need for this and the following changes. See my comment about IsLossy.

Done.


worker/tokens.go, line 90 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Same about lossy. No need for these changes.

Done.


x/hash.go, line 21 at r1 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

No need for this indirection. blake2b.Sum256 can be called right where it's used.

Done.

Copy link

@gitlw gitlw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 6 unresolved discussions (waiting on @manishrjain and @srfrog)


worker/tokens.go, line 38 at r2 (raw file):

	}

	if !schema.State().IsIndexed(attr) {

minor: do you think it's better to move this check to the beginning of this function as the first step? That way, setting the requiredTokenizer variable and using it can be brought closer as well.

Copy link
Contributor Author

@srfrog srfrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 6 unresolved discussions (waiting on @manishrjain and @gitlw)


worker/tokens.go, line 38 at r2 (raw file):

Previously, gitlw (Lucas Wang) wrote…

minor: do you think it's better to move this check to the beginning of this function as the first step? That way, setting the requiredTokenizer variable and using it can be brought closer as well.

no, unfortunately we need to know the type of tokenizer to respond with the name.

@srfrog srfrog changed the title WIP: Improve hash index Improve hash index Jan 11, 2019
@srfrog srfrog dismissed manishrjain’s stale review January 11, 2019 05:33

All changes completed

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: I've made a few changes. Have a look and then feel free to merge.

Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions (waiting on @manishrjain and @srfrog)


tok/tok.go, line 375 at r1 (raw file):

Previously, srfrog (Gus) wrote…

Done.

I added the comment.


tok/tok.go, line 127 at r3 (raw file):

	id := tokenizer.Identifier()
	x.AssertTruef(id < IdentCustom,

I think you accidentally flipped the bool around. I fixed it.

Copy link
Contributor Author

@srfrog srfrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions (waiting on @manishrjain and @srfrog)


tok/tok.go, line 127 at r3 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

I think you accidentally flipped the bool around. I fixed it.

oops thx

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you reintroduced some bugs in your later commits. I retract my LGTM. Needs more review.

:lgtm_cancel:

Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions (waiting on @manishrjain)

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Remove the glog as I mentioned in the comment below. And then feel free to merge it .

Reviewable status: 0 of 6 files reviewed, 4 unresolved discussions (waiting on @manishrjain and @srfrog)


worker/tokens.go, line 103 at r5 (raw file):

		return nil, x.Errorf("Attribute:%s does not have proper index for comparison", attr)
	}
	glog.Infof("Attribute:%s couldn't find a non-lossy tokenizer for 'eq' comparison", attr)

Remove the glog.

Copy link
Contributor Author

@srfrog srfrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 4 unresolved discussions (waiting on @manishrjain)


worker/tokens.go, line 103 at r5 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Remove the glog.

Done.

@srfrog srfrog merged commit 53eb672 into master Jan 16, 2019
@srfrog srfrog deleted the srfrog/issue-2776_improve_hash_index branch January 16, 2019 03:28
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 19, 2019
* saving state

* added new fingerprint func using BLAKE2b

* renamed function to Hash256 for clarity.

* replaced 64 fingerprint hash with Hash256

* pickTokenizer use hash tokenizer when list is lossy.

* added tokenizer identifier list for enforcing tokenizer.

* compare func using hash index if available and eq won't compare values

* fixed minor comment glitches

* use tokenizer identifier consts, change hash to non-lossy.

* using non-lossy hash so no need for extra logic in handleCompareFunction

* simplify pickTokenizer and

* simplify pickTokenizer

* using tokenizer id

* added id value for custom tokenizers, IdentCustom

* using tokenizer ids when possible
fixed bug in getInequalityTokens with fulltext indexes.

* added hash index tests

* Manish's review. Fixed a new bug introduced by this PR during IdentCustom comparison. Simplify pickTokenizer. Added comments.

* Remove Long term for exact index warning.

* fixed logic

* pickTokenizer return error when comparison func doesn't have non-lossy (eq) or sortable (le, ge, gt, lt) index

* added warning for eq comparison without non-lossy tokenizer

* re-fixed this slippery lil bug

* removed extra glog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants