-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rustdoc: compute maximum Levenshtein distance based on the query #103710
Conversation
The heuristic is pretty close to the name resolver. Fixes rust-lang#103357
r? @jsha (rustbot has picked a reviewer for you, use r? to override) |
Some changes occurred in HTML/CSS/JS. cc @GuillaumeGomez, @Folyd, @jsha |
This comment has been minimized.
This comment has been minimized.
Thanks, looks like a nice improvement! Considering it's quite a big change for users, I'd prefer to have more people from rustdoc to take a look first. cc @rust-lang/rustdoc |
Looking at just the results, not the code: this is a big improvement in the results for short and no-results queries. Thanks! There's still a problem where single-character searches return a lot of |
That’s because Math.ceil(1/3)=1, and the edit distance between I’d almost rather just exclude tuple indices from the search index. Is anyone really going to type |
Yeah we should exclude them |
Maybe we should use
+1 |
Oh, I thought it was something specific about the
So, yes, I agree we should just not index these; but also we should use floor. Otherwise, once all the |
Seems like we all agree on removing the tuple elements from the index. I'll send a PR tomorrow. |
The problem with searching with rust/src/librustdoc/html/static/js/search.js Lines 1328 to 1349 in cc8155f
|
Maybe we should change this behaviour? Something like:
I'm curious of what the results would look like. |
I think it's a mistake to use levenshtein distance also the ranking metric and apply boosts and demerits to it. The existing code already does this to a large extent and it makes things very confusing and hard to maintain. It would be better to more explicitly separate the matching and ranking functionality. |
That would make things simpler. I'm also curious to see what the results will look like. |
…struct-field, r=notriddle Remove generation of tuple struct fields in the search index This comes from [this discussion](rust-lang#103710) as they're not very useful. r? `@notriddle`
☔ The latest upstream changes (presumably #103829) made this pull request unmergeable. Please resolve the merge conflicts. |
Since the sorting function accounts for an `index` field, there's not much reason to also be applying changes to the levenshtein distance. Instead, we can just not treat `lev` as a filter if there's already a non-sentinel value for `index`. This change gives slightly more weight to the index and path part, as search criteria, than it used to. This changes some of the test cases, but not in any obviously-"worse" way, and, in particular, substring matches are a bigger deal than levenshtein distances (we're assuming that a typo is less likely than someone just not typing the entire name). Based on rust-lang#103710 (comment)
…-stop-doing-demerits, r=GuillaumeGomez rustdoc: simplify JS search routine by not messing with lev distance Since the sorting function accounts for an `index` field, there's not much reason to also be applying changes to the levenshtein distance. Instead, we can just not treat `lev` as a filter if there's already a non-sentinel value for `index`. <details> This change gives slightly more weight to the index and path part, as search criteria, than it used to. This changes some of the test cases, but not in any obviously-"worse" way, and, in particular, substring matches are a bigger deal than levenshtein distances (we're assuming that a typo is less likely than someone just not typing the entire name). The biggest change is the addition of a `path_lev` field to result items. It's always zero if the search query has no parent path part and for type queries, making the check in the `sortResults` function a no-op. When it's present, it is used to implement different precedence for the parent path and the tail. Consider the query `hashset::insert`, a test case [that already exists and can be found here](https://github.com/rust-lang/rust/blob/5c6a1681a9a7b815febdd9de2f840da338984e68/src/test/rustdoc-js-std/path-ordering.js). We want the ordering shown in the test case: ``` { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' }, { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' }, ``` We do not want this ordering, which is the ordering that would occur if substring position took priority over `path_lev`: ``` { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' }, { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' }, // BAD { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' }, ``` We also do not want `HashSet::iter` to appear before `HashMap::insert`, which is what would happen if `path_lev` took priority over the appearance of any substring match. This is why the `sortResults` function has `path_lev` sandwiched between a `index < 0` check and a `index` comparison check: ``` { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'iter' }, // BAD { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' }, ``` The old code implemented a similar feature by manipulating the `lev` member based on whether a substring match was found and averaging in the path distance (`item.lev = name_lev + path_lev / 10`), so the path lev wound up acting like a tie breaker, but it gives slightly different results for `Vec::new`, [changing the test case](https://github.com/rust-lang/rust/pull/105796/files#diff-b346e2ef72a407915f438063c8c2c04f7a621df98923d441b41c0312211a5b21) because of the slight changes to ordering priority. </details> Based on rust-lang#103710 (comment) Previews: * https://notriddle.com/notriddle-rustdoc-demos/rustdoc-search-stop-doing-demerits/std/index.html * https://notriddle.com/notriddle-rustdoc-demos/rustdoc-search-stop-doing-demerits-compiler/index.html
…-stop-doing-demerits, r=GuillaumeGomez rustdoc: simplify JS search routine by not messing with lev distance Since the sorting function accounts for an `index` field, there's not much reason to also be applying changes to the levenshtein distance. Instead, we can just not treat `lev` as a filter if there's already a non-sentinel value for `index`. <details> This change gives slightly more weight to the index and path part, as search criteria, than it used to. This changes some of the test cases, but not in any obviously-"worse" way, and, in particular, substring matches are a bigger deal than levenshtein distances (we're assuming that a typo is less likely than someone just not typing the entire name). The biggest change is the addition of a `path_lev` field to result items. It's always zero if the search query has no parent path part and for type queries, making the check in the `sortResults` function a no-op. When it's present, it is used to implement different precedence for the parent path and the tail. Consider the query `hashset::insert`, a test case [that already exists and can be found here](https://github.com/rust-lang/rust/blob/5c6a1681a9a7b815febdd9de2f840da338984e68/src/test/rustdoc-js-std/path-ordering.js). We want the ordering shown in the test case: ``` { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' }, { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' }, ``` We do not want this ordering, which is the ordering that would occur if substring position took priority over `path_lev`: ``` { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' }, { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' }, // BAD { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' }, ``` We also do not want `HashSet::iter` to appear before `HashMap::insert`, which is what would happen if `path_lev` took priority over the appearance of any substring match. This is why the `sortResults` function has `path_lev` sandwiched between a `index < 0` check and a `index` comparison check: ``` { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'iter' }, // BAD { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' }, ``` The old code implemented a similar feature by manipulating the `lev` member based on whether a substring match was found and averaging in the path distance (`item.lev = name_lev + path_lev / 10`), so the path lev wound up acting like a tie breaker, but it gives slightly different results for `Vec::new`, [changing the test case](https://github.com/rust-lang/rust/pull/105796/files#diff-b346e2ef72a407915f438063c8c2c04f7a621df98923d441b41c0312211a5b21) because of the slight changes to ordering priority. </details> Based on rust-lang#103710 (comment) Previews: * https://notriddle.com/notriddle-rustdoc-demos/rustdoc-search-stop-doing-demerits/std/index.html * https://notriddle.com/notriddle-rustdoc-demos/rustdoc-search-stop-doing-demerits-compiler/index.html
…-2023, r=GuillaumeGomez rustdoc: compute maximum Levenshtein distance based on the query Preview: https://notriddle.com/notriddle-rustdoc-demos/search-lev-distance-2023/std/index.html?search=regex The heuristic is pretty close to the name resolver, maxLevDistance = `Math.floor(queryLen / 3)`. Fixes rust-lang#103357 Fixes rust-lang#82131 Similar to rust-lang#103710, but following the suggestion in rust-lang#103710 (comment) to use `floor` instead of `ceil`, and unblocked now that rust-lang#105796 made it so that setting the max lev distance to `0` doesn't cause substring matches to be removed.
…GuillaumeGomez rustdoc: compute maximum Levenshtein distance based on the query Preview: https://notriddle.com/notriddle-rustdoc-demos/search-lev-distance-2023/std/index.html?search=regex The heuristic is pretty close to the name resolver, maxLevDistance = `Math.floor(queryLen / 3)`. Fixes #103357 Fixes #82131 Similar to rust-lang/rust#103710, but following the suggestion in rust-lang/rust#103710 (comment) to use `floor` instead of `ceil`, and unblocked now that rust-lang/rust#105796 made it so that setting the max lev distance to `0` doesn't cause substring matches to be removed.
Preview: https://notriddle.com/notriddle-rustdoc-demos/search-lev-distance/std/index.html?search=regex
The heuristic is pretty close to the name resolver,
maxLevDistance = Math.ceil(queryLen / 3)
.Fixes #103357
Fixes #82131