-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade Bleve to the latest #7040
Conversation
Deploy preview for dgraph-docs ready! Built with commit bdea6d6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes are we bringing in here by upgrading bleve?
Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @ajeetdsouza, @danielmai, and @vvbalaji-dgraph)
some tokenizers were updated (language tokenizers)' Also, Bleve's dependencies have shrunk too. So we are depending on fewer odd case things |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to document any user-facing changes?
Reviewable status: 0 of 2 files reviewed, all discussions resolved (waiting on @ajeetdsouza, @danielmai, and @vvbalaji-dgraph)
nope. Chinese works better now I think |
but fundamentally the functions are the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like updating bleve generates different tokens, then? I'm ok bringing this into v20.11. Is it right that this changes the index tokens that require a re-index to create the proper tokens for language full-text indices? So, it should not be cherry-picked as patches to the current releases.
Reviewed 2 of 2 files at r1.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ajeetdsouza and @vvbalaji-dgraph)
tbqh, I've checked out some chinese tokens... it looks exactly the same to me. But the algorithms look to be less probabilistic than before. (the issue with Chinese tokenization of course, is that Chinese doesn't use spaces to delimit words, so to guess "words" as tokens you need to guess, hence probabilistic). |
I think we should bring this to 20.11. The current tagged Bleve is 2 years old, and in 2 years there are A LOT of advancements to tokenization tech. |
Upgraded Bleve to the latest version.
https://discuss.dgraph.io/t/can-upgrade-bleve-version-to-the-latest/11673/4
This change is