rustdoc: compute maximum Levenshtein distance based on the query #103710

notriddle · 2022-10-29T05:01:54Z

Preview: https://notriddle.com/notriddle-rustdoc-demos/search-lev-distance/std/index.html?search=regex

The heuristic is pretty close to the name resolver, maxLevDistance = Math.ceil(queryLen / 3).

Fixes #103357
Fixes #82131

The heuristic is pretty close to the name resolver. Fixes rust-lang#103357

rustbot · 2022-10-29T05:02:00Z

r? @jsha

(rustbot has picked a reviewer for you, use r? to override)

rustbot · 2022-10-29T05:02:02Z

Some changes occurred in HTML/CSS/JS.

cc @GuillaumeGomez, @Folyd, @jsha

GuillaumeGomez · 2022-10-29T09:42:27Z

Thanks, looks like a nice improvement! Considering it's quite a big change for users, I'd prefer to have more people from rustdoc to take a look first.

cc @rust-lang/rustdoc

jsha · 2022-10-30T10:54:52Z

Looking at just the results, not the code: this is a big improvement in the results for short and no-results queries. Thanks!

There's still a problem where single-character searches return a lot of ::0 enum fields, which I thought would happen to be fixed by this. But that's not a blocker by any means; presumably we just need to find the fix elsewhere.

notriddle · 2022-10-30T14:21:51Z

That’s because Math.ceil(1/3)=1, and the edit distance between 0 and X is 1. The problem with tweaking the heuristic to make the number come out 0 is that the results aren’t really better: I would like X to match XY, but that one also has an edit distance of 1.

I’d almost rather just exclude tuple indices from the search index. Is anyone really going to type Some::0 into the search engine?

Manishearth · 2022-10-30T14:29:12Z

Yeah we should exclude them

GuillaumeGomez · 2022-10-30T15:16:40Z

That’s because Math.ceil(1/3)=1, and the edit distance between 0 and X is 1. The problem with tweaking the heuristic to make the number come out 0 is that the results aren’t really better: I would like X to match XY, but that one also has an edit distance of 1.

Maybe we should use floor instead? Also, it should still have xy if you search for x since xy starts with x.

I’d almost rather just exclude tuple indices from the search index. Is anyone really going to type Some::0 into the search engine?

+1

jsha · 2022-10-30T22:02:34Z

Oh, I thought it was something specific about the 0; but it's just the combination of:

maxLevenshtein for query of length 1 is 1, plus
0 sorts ahead of everything else, plus
we cut off results past 100

So, yes, I agree we should just not index these; but also we should use floor. Otherwise, once all the 0s are gone, any other single-letter idents will match any single-letter queries.

GuillaumeGomez · 2022-10-30T22:04:02Z

Seems like we all agree on removing the tuple elements from the index. I'll send a PR tomorrow.

notriddle · 2022-10-30T22:40:19Z

Also, it should still have xy if you search for x since xy starts with x.

The problem with searching with maxLevDistance = 0 is that substring matches are not necessary treated as lev = 0. They just act as a modifier for levenshtein distance.

rust/src/librustdoc/html/static/js/search.js

Lines 1328 to 1349 in cc8155f

    
           if (searchWord.indexOf(elem.pathLast) > -1 || 
        
               row.normalizedName.indexOf(elem.pathLast) > -1 
        
           ) { 
        
               index = row.normalizedName.indexOf(elem.pathLast); 
        
           } 
        
           lev = levenshtein(searchWord, elem.pathLast); 
        
           if (lev > 0 && elem.pathLast.length > 2 && searchWord.indexOf(elem.pathLast) > -1) { 
        
               if (elem.pathLast.length < 6) { 
        
                   lev = 1; 
        
               } else { 
        
                   lev = 0; 
        
               } 
        
           } 
        
           lev += lev_add; 
        
           if (lev > maxLevDistance) { 
        
               return; 
        
           } else if (index !== -1 && elem.fullPath.length < 2) { 
        
               lev -= 1; 
        
           } 
        
           if (lev < 0) { 
        
               lev = 0; 
        
           }

GuillaumeGomez · 2022-10-31T10:25:33Z

Maybe we should change this behaviour? Something like:

If the item name starts with X, then give it a lev distance of 2.
If the item name contains X, then give it a lev distance of 3.

I'm curious of what the results would look like.

jsha · 2022-10-31T10:32:20Z

I think it's a mistake to use levenshtein distance also the ranking metric and apply boosts and demerits to it. The existing code already does this to a large extent and it makes things very confusing and hard to maintain. It would be better to more explicitly separate the matching and ranking functionality.

GuillaumeGomez · 2022-10-31T10:36:23Z

That would make things simpler. I'm also curious to see what the results will look like.

…struct-field, r=notriddle Remove generation of tuple struct fields in the search index This comes from [this discussion](rust-lang#103710) as they're not very useful. r? `@notriddle`

bors · 2022-11-01T09:32:51Z

☔ The latest upstream changes (presumably #103829) made this pull request unmergeable. Please resolve the merge conflicts.

Since the sorting function accounts for an `index` field, there's not much reason to also be applying changes to the levenshtein distance. Instead, we can just not treat `lev` as a filter if there's already a non-sentinel value for `index`. This change gives slightly more weight to the index and path part, as search criteria, than it used to. This changes some of the test cases, but not in any obviously-"worse" way, and, in particular, substring matches are a bigger deal than levenshtein distances (we're assuming that a typo is less likely than someone just not typing the entire name). Based on rust-lang#103710 (comment)

…-stop-doing-demerits, r=GuillaumeGomez rustdoc: simplify JS search routine by not messing with lev distance Since the sorting function accounts for an `index` field, there's not much reason to also be applying changes to the levenshtein distance. Instead, we can just not treat `lev` as a filter if there's already a non-sentinel value for `index`. <details> This change gives slightly more weight to the index and path part, as search criteria, than it used to. This changes some of the test cases, but not in any obviously-"worse" way, and, in particular, substring matches are a bigger deal than levenshtein distances (we're assuming that a typo is less likely than someone just not typing the entire name). The biggest change is the addition of a `path_lev` field to result items. It's always zero if the search query has no parent path part and for type queries, making the check in the `sortResults` function a no-op. When it's present, it is used to implement different precedence for the parent path and the tail. Consider the query `hashset::insert`, a test case [that already exists and can be found here](https://github.com/rust-lang/rust/blob/5c6a1681a9a7b815febdd9de2f840da338984e68/src/test/rustdoc-js-std/path-ordering.js). We want the ordering shown in the test case: ``` { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' }, { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' }, ``` We do not want this ordering, which is the ordering that would occur if substring position took priority over `path_lev`: ``` { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' }, { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' }, // BAD { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' }, ``` We also do not want `HashSet::iter` to appear before `HashMap::insert`, which is what would happen if `path_lev` took priority over the appearance of any substring match. This is why the `sortResults` function has `path_lev` sandwiched between a `index < 0` check and a `index` comparison check: ``` { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' }, { 'path': 'std::collections::hash_set::HashSet', 'name': 'iter' }, // BAD { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' }, ``` The old code implemented a similar feature by manipulating the `lev` member based on whether a substring match was found and averaging in the path distance (`item.lev = name_lev + path_lev / 10`), so the path lev wound up acting like a tie breaker, but it gives slightly different results for `Vec::new`, [changing the test case](https://github.com/rust-lang/rust/pull/105796/files#diff-b346e2ef72a407915f438063c8c2c04f7a621df98923d441b41c0312211a5b21) because of the slight changes to ordering priority. </details> Based on rust-lang#103710 (comment) Previews: * https://notriddle.com/notriddle-rustdoc-demos/rustdoc-search-stop-doing-demerits/std/index.html * https://notriddle.com/notriddle-rustdoc-demos/rustdoc-search-stop-doing-demerits-compiler/index.html

…-2023, r=GuillaumeGomez rustdoc: compute maximum Levenshtein distance based on the query Preview: https://notriddle.com/notriddle-rustdoc-demos/search-lev-distance-2023/std/index.html?search=regex The heuristic is pretty close to the name resolver, maxLevDistance = `Math.floor(queryLen / 3)`. Fixes rust-lang#103357 Fixes rust-lang#82131 Similar to rust-lang#103710, but following the suggestion in rust-lang#103710 (comment) to use `floor` instead of `ceil`, and unblocked now that rust-lang#105796 made it so that setting the max lev distance to `0` doesn't cause substring matches to be removed.

…GuillaumeGomez rustdoc: compute maximum Levenshtein distance based on the query Preview: https://notriddle.com/notriddle-rustdoc-demos/search-lev-distance-2023/std/index.html?search=regex The heuristic is pretty close to the name resolver, maxLevDistance = `Math.floor(queryLen / 3)`. Fixes #103357 Fixes #82131 Similar to rust-lang/rust#103710, but following the suggestion in rust-lang/rust#103710 (comment) to use `floor` instead of `ceil`, and unblocked now that rust-lang/rust#105796 made it so that setting the max lev distance to `0` doesn't cause substring matches to be removed.

rustdoc: compute maximum Levenshtein distance based on the query

baa0f64

The heuristic is pretty close to the name resolver. Fixes rust-lang#103357

rustbot assigned jsha Oct 29, 2022

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Oct 29, 2022

This comment has been minimized.

Sign in to view

rustdoc: update test case to deal with "coo" only accepting dist=1

cc8155f

GuillaumeGomez mentioned this pull request Oct 31, 2022

Remove generation of tuple struct fields in the search index #103799

Merged

notriddle closed this Dec 16, 2022

notriddle mentioned this pull request Dec 16, 2022

rustdoc: simplify JS search routine by not messing with lev distance #105796

Merged

notriddle mentioned this pull request Jan 21, 2023

rustdoc: compute maximum Levenshtein distance based on the query #107141

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rustdoc: compute maximum Levenshtein distance based on the query #103710

rustdoc: compute maximum Levenshtein distance based on the query #103710

Uh oh!

notriddle commented Oct 29, 2022 •

edited

Loading

Uh oh!

rustbot commented Oct 29, 2022

Uh oh!

rustbot commented Oct 29, 2022

Uh oh!

This comment has been minimized.

GuillaumeGomez commented Oct 29, 2022

Uh oh!

jsha commented Oct 30, 2022

Uh oh!

notriddle commented Oct 30, 2022

Uh oh!

Manishearth commented Oct 30, 2022

Uh oh!

GuillaumeGomez commented Oct 30, 2022

Uh oh!

jsha commented Oct 30, 2022

Uh oh!

GuillaumeGomez commented Oct 30, 2022

Uh oh!

notriddle commented Oct 30, 2022

Uh oh!

GuillaumeGomez commented Oct 31, 2022

Uh oh!

jsha commented Oct 31, 2022

Uh oh!

GuillaumeGomez commented Oct 31, 2022

Uh oh!

bors commented Nov 1, 2022

Uh oh!

Uh oh!

rustdoc: compute maximum Levenshtein distance based on the query #103710

rustdoc: compute maximum Levenshtein distance based on the query #103710

Uh oh!

Conversation

notriddle commented Oct 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Oct 29, 2022

Uh oh!

rustbot commented Oct 29, 2022

Uh oh!

This comment has been minimized.

GuillaumeGomez commented Oct 29, 2022

Uh oh!

jsha commented Oct 30, 2022

Uh oh!

notriddle commented Oct 30, 2022

Uh oh!

Manishearth commented Oct 30, 2022

Uh oh!

GuillaumeGomez commented Oct 30, 2022

Uh oh!

jsha commented Oct 30, 2022

Uh oh!

GuillaumeGomez commented Oct 30, 2022

Uh oh!

notriddle commented Oct 30, 2022

Uh oh!

GuillaumeGomez commented Oct 31, 2022

Uh oh!

jsha commented Oct 31, 2022

Uh oh!

GuillaumeGomez commented Oct 31, 2022

Uh oh!

bors commented Nov 1, 2022

Uh oh!

Uh oh!

notriddle commented Oct 29, 2022 •

edited

Loading