Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustdoc: compute maximum Levenshtein distance based on the query #103710

Closed

Conversation

notriddle
Copy link
Contributor

@notriddle notriddle commented Oct 29, 2022

Preview: https://notriddle.com/notriddle-rustdoc-demos/search-lev-distance/std/index.html?search=regex

The heuristic is pretty close to the name resolver, maxLevDistance = Math.ceil(queryLen / 3).

Fixes #103357
Fixes #82131

The heuristic is pretty close to the name resolver.

Fixes rust-lang#103357
@rustbot
Copy link
Collaborator

rustbot commented Oct 29, 2022

r? @jsha

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Oct 29, 2022
@rustbot
Copy link
Collaborator

rustbot commented Oct 29, 2022

Some changes occurred in HTML/CSS/JS.

cc @GuillaumeGomez, @Folyd, @jsha

@rust-log-analyzer

This comment has been minimized.

@GuillaumeGomez
Copy link
Member

Thanks, looks like a nice improvement! Considering it's quite a big change for users, I'd prefer to have more people from rustdoc to take a look first.

cc @rust-lang/rustdoc

@jsha
Copy link
Contributor

jsha commented Oct 30, 2022

Looking at just the results, not the code: this is a big improvement in the results for short and no-results queries. Thanks!

There's still a problem where single-character searches return a lot of ::0 enum fields, which I thought would happen to be fixed by this. But that's not a blocker by any means; presumably we just need to find the fix elsewhere.

@notriddle
Copy link
Contributor Author

That’s because Math.ceil(1/3)=1, and the edit distance between 0 and X is 1. The problem with tweaking the heuristic to make the number come out 0 is that the results aren’t really better: I would like X to match XY, but that one also has an edit distance of 1.

I’d almost rather just exclude tuple indices from the search index. Is anyone really going to type Some::0 into the search engine?

@Manishearth
Copy link
Member

Yeah we should exclude them

@GuillaumeGomez
Copy link
Member

That’s because Math.ceil(1/3)=1, and the edit distance between 0 and X is 1. The problem with tweaking the heuristic to make the number come out 0 is that the results aren’t really better: I would like X to match XY, but that one also has an edit distance of 1.

Maybe we should use floor instead? Also, it should still have xy if you search for x since xy starts with x.

I’d almost rather just exclude tuple indices from the search index. Is anyone really going to type Some::0 into the search engine?

+1

@jsha
Copy link
Contributor

jsha commented Oct 30, 2022

Oh, I thought it was something specific about the 0; but it's just the combination of:

  • maxLevenshtein for query of length 1 is 1, plus
  • 0 sorts ahead of everything else, plus
  • we cut off results past 100

So, yes, I agree we should just not index these; but also we should use floor. Otherwise, once all the 0s are gone, any other single-letter idents will match any single-letter queries.

@GuillaumeGomez
Copy link
Member

Seems like we all agree on removing the tuple elements from the index. I'll send a PR tomorrow.

@notriddle
Copy link
Contributor Author

Also, it should still have xy if you search for x since xy starts with x.

The problem with searching with maxLevDistance = 0 is that substring matches are not necessary treated as lev = 0. They just act as a modifier for levenshtein distance.

if (searchWord.indexOf(elem.pathLast) > -1 ||
row.normalizedName.indexOf(elem.pathLast) > -1
) {
index = row.normalizedName.indexOf(elem.pathLast);
}
lev = levenshtein(searchWord, elem.pathLast);
if (lev > 0 && elem.pathLast.length > 2 && searchWord.indexOf(elem.pathLast) > -1) {
if (elem.pathLast.length < 6) {
lev = 1;
} else {
lev = 0;
}
}
lev += lev_add;
if (lev > maxLevDistance) {
return;
} else if (index !== -1 && elem.fullPath.length < 2) {
lev -= 1;
}
if (lev < 0) {
lev = 0;
}

@GuillaumeGomez
Copy link
Member

Maybe we should change this behaviour? Something like:

  • If the item name starts with X, then give it a lev distance of 2.
  • If the item name contains X, then give it a lev distance of 3.

I'm curious of what the results would look like.

@jsha
Copy link
Contributor

jsha commented Oct 31, 2022

I think it's a mistake to use levenshtein distance also the ranking metric and apply boosts and demerits to it. The existing code already does this to a large extent and it makes things very confusing and hard to maintain. It would be better to more explicitly separate the matching and ranking functionality.

@GuillaumeGomez
Copy link
Member

That would make things simpler. I'm also curious to see what the results will look like.

JohnTitor pushed a commit to JohnTitor/rust that referenced this pull request Nov 1, 2022
…struct-field, r=notriddle

Remove generation of tuple struct fields in the search index

This comes from [this discussion](rust-lang#103710) as they're not very useful.

r? `@notriddle`
@bors
Copy link
Contributor

bors commented Nov 1, 2022

☔ The latest upstream changes (presumably #103829) made this pull request unmergeable. Please resolve the merge conflicts.

@notriddle notriddle closed this Dec 16, 2022
notriddle added a commit to notriddle/rust that referenced this pull request Jan 14, 2023
Since the sorting function accounts for an `index` field, there's not much
reason to also be applying changes to the levenshtein distance. Instead,
we can just not treat `lev` as a filter if there's already a non-sentinel
value for `index`.

This change gives slightly more weight to the index and path part, as
search criteria, than it used to. This changes some of the test cases,
but not in any obviously-"worse" way, and, in particular, substring matches
are a bigger deal than levenshtein distances (we're assuming that a typo
is less likely than someone just not typing the entire name).

Based on
rust-lang#103710 (comment)
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Jan 18, 2023
…-stop-doing-demerits, r=GuillaumeGomez

rustdoc: simplify JS search routine by not messing with lev distance

Since the sorting function accounts for an `index` field, there's not much reason to also be applying changes to the levenshtein distance. Instead, we can just not treat `lev` as a filter if there's already a non-sentinel value for `index`.

<details>

This change gives slightly more weight to the index and path part, as search criteria, than it used to. This changes some of the test cases, but not in any obviously-"worse" way, and, in particular, substring matches are a bigger deal than levenshtein distances (we're assuming that a typo is less likely than someone just not typing the entire name).

The biggest change is the addition of a `path_lev` field to result items. It's always zero if the search query has no parent path part and for type queries, making the check in the `sortResults` function a no-op. When it's present, it is used to implement different precedence for the parent path and the tail.

Consider the query `hashset::insert`, a test case [that already exists and can be found here](https://github.com/rust-lang/rust/blob/5c6a1681a9a7b815febdd9de2f840da338984e68/src/test/rustdoc-js-std/path-ordering.js). We want the ordering shown in the test case:

```
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' },
        { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' },
```

We do not want this ordering, which is the ordering that would occur if substring position took priority over `path_lev`:

```
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' },
        { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' }, // BAD
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' },
```

We also do not want `HashSet::iter` to appear before `HashMap::insert`, which is what would happen if `path_lev` took priority over the appearance of any substring match. This is why the `sortResults` function has `path_lev` sandwiched between a `index < 0` check and a `index` comparison check:

```
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'iter' }, // BAD
        { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' },
```

The old code implemented a similar feature by manipulating the `lev` member based on whether a substring match was found and averaging in the path distance (`item.lev = name_lev + path_lev / 10`), so the path lev wound up acting like a tie breaker, but it gives slightly different results for `Vec::new`, [changing the test case](https://github.com/rust-lang/rust/pull/105796/files#diff-b346e2ef72a407915f438063c8c2c04f7a621df98923d441b41c0312211a5b21) because of the slight changes to ordering priority.

</details>

Based on rust-lang#103710 (comment)

Previews:

* https://notriddle.com/notriddle-rustdoc-demos/rustdoc-search-stop-doing-demerits/std/index.html
* https://notriddle.com/notriddle-rustdoc-demos/rustdoc-search-stop-doing-demerits-compiler/index.html
compiler-errors added a commit to compiler-errors/rust that referenced this pull request Jan 18, 2023
…-stop-doing-demerits, r=GuillaumeGomez

rustdoc: simplify JS search routine by not messing with lev distance

Since the sorting function accounts for an `index` field, there's not much reason to also be applying changes to the levenshtein distance. Instead, we can just not treat `lev` as a filter if there's already a non-sentinel value for `index`.

<details>

This change gives slightly more weight to the index and path part, as search criteria, than it used to. This changes some of the test cases, but not in any obviously-"worse" way, and, in particular, substring matches are a bigger deal than levenshtein distances (we're assuming that a typo is less likely than someone just not typing the entire name).

The biggest change is the addition of a `path_lev` field to result items. It's always zero if the search query has no parent path part and for type queries, making the check in the `sortResults` function a no-op. When it's present, it is used to implement different precedence for the parent path and the tail.

Consider the query `hashset::insert`, a test case [that already exists and can be found here](https://github.com/rust-lang/rust/blob/5c6a1681a9a7b815febdd9de2f840da338984e68/src/test/rustdoc-js-std/path-ordering.js). We want the ordering shown in the test case:

```
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' },
        { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' },
```

We do not want this ordering, which is the ordering that would occur if substring position took priority over `path_lev`:

```
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' },
        { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' }, // BAD
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' },
```

We also do not want `HashSet::iter` to appear before `HashMap::insert`, which is what would happen if `path_lev` took priority over the appearance of any substring match. This is why the `sortResults` function has `path_lev` sandwiched between a `index < 0` check and a `index` comparison check:

```
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'insert' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_with' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'get_or_insert_owned' },
        { 'path': 'std::collections::hash_set::HashSet', 'name': 'iter' }, // BAD
        { 'path': 'std::collections::hash_map::HashMap', 'name': 'insert' },
```

The old code implemented a similar feature by manipulating the `lev` member based on whether a substring match was found and averaging in the path distance (`item.lev = name_lev + path_lev / 10`), so the path lev wound up acting like a tie breaker, but it gives slightly different results for `Vec::new`, [changing the test case](https://github.com/rust-lang/rust/pull/105796/files#diff-b346e2ef72a407915f438063c8c2c04f7a621df98923d441b41c0312211a5b21) because of the slight changes to ordering priority.

</details>

Based on rust-lang#103710 (comment)

Previews:

* https://notriddle.com/notriddle-rustdoc-demos/rustdoc-search-stop-doing-demerits/std/index.html
* https://notriddle.com/notriddle-rustdoc-demos/rustdoc-search-stop-doing-demerits-compiler/index.html
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 6, 2023
…-2023, r=GuillaumeGomez

rustdoc: compute maximum Levenshtein distance based on the query

Preview: https://notriddle.com/notriddle-rustdoc-demos/search-lev-distance-2023/std/index.html?search=regex

The heuristic is pretty close to the name resolver, maxLevDistance = `Math.floor(queryLen / 3)`.

Fixes rust-lang#103357
Fixes rust-lang#82131

Similar to rust-lang#103710, but following the suggestion in rust-lang#103710 (comment) to use `floor` instead of `ceil`, and unblocked now that rust-lang#105796 made it so that setting the max lev distance to `0` doesn't cause substring matches to be removed.
bors added a commit to rust-lang/miri that referenced this pull request Feb 7, 2023
…GuillaumeGomez

rustdoc: compute maximum Levenshtein distance based on the query

Preview: https://notriddle.com/notriddle-rustdoc-demos/search-lev-distance-2023/std/index.html?search=regex

The heuristic is pretty close to the name resolver, maxLevDistance = `Math.floor(queryLen / 3)`.

Fixes #103357
Fixes #82131

Similar to rust-lang/rust#103710, but following the suggestion in rust-lang/rust#103710 (comment) to use `floor` instead of `ceil`, and unblocked now that rust-lang/rust#105796 made it so that setting the max lev distance to `0` doesn't cause substring matches to be removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
7 participants