Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify SourceID Hash #14800

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

simplify SourceID Hash #14800

wants to merge 1 commit into from

Conversation

Eh2406
Copy link
Contributor

@Eh2406 Eh2406 commented Nov 8, 2024

What does this PR try to resolve?

Despite being interned SourceId::Eq is not a ptr::eq. Which in turn is because SourceIds concept of identity is a complete mess. The code relies on having to IDs that are Eq but do not have the same values for their fields.

As one measure of this SourceId has an impl Hash which does something different from fn full_hash and fn stable_hash. Separately SourceIdInner has a different implementation. Similar levels of complexity exist for Eq. Every one of these impls was added due to a real bug/issue we've had that needs to stay fixed. Not all of witch are reproducible enough to have made it into our test suite.

I have some ideas for how to reorganize the types so that this is easier to reason about and faster. But given the history and the complexity I want to move extremely carefully.

How should we test and review this PR?

The test pass, and it's a one line change, but this still needs careful review.

Additional information

r? @ehuss I remember you and Alex working very hard to track down most of these bugs.

@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Nov 8, 2024
@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 8, 2024

If this turns out not to be a valid transformation I am happy to update the PR to be tests to prove why it's not okay.

Comment on lines -676 to 675
_ => self.inner.url.as_str().hash(into),
}
self.inner.canonical_url.hash(into);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most CanonicalUrl differences are related to git somehow. I'm assuming that means we can't hit them in any other way. There is the stripping of a trailing /. However, we probably should have been respecting that in the hash in the first place I assume. The question is if this will change anything significant enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps if you have a index hosted on git, then GITHUB.com/me/my-index.git/ could change a lot. I think the biggest concern here is having to recheck out dependencies. So even if this does affect people we might just eat that cost. On the other hand there's a lot of complexity here and maybe there are bigger problems I forgot about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't yet looked closer. I wonder if this would affect hacks like this one: #5478 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would be affected by this change. But it is something we have to watch for as I "clean up" this code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, since [patch.<url>] uses URLs directly, not SourceId.

This change may affect anywhere we let people define URL and parse it as SourceId, like the [registries] table? We could also deem this as a bug and fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope that we rely on Eq for those cases and not Hash. So that the problem you identified is not a problem with this PR, rather one to investigate on the next PR that changes Eq.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Have checked other usages of it, none of them seems user-facing like the path to index files. It should be fine to move over.

Copy link
Member

@weihanglo weihanglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the paths to crates.io Git and spare index on users disks are not affected by this change, you have my vote.

@ehuss
Copy link
Contributor

ehuss commented Nov 12, 2024

To be clear, the goal here is to simplify the code a little bit? Are there any other intended improvements?

And the downside is that if someone has a custom registry that canonicalizes differently (like a trailing slash, or upper-case characters in a GitHub URL), it will hash differently and they will end up with new directories in ~/.cargo/registry/*/..., and duplicated cache contents?

If that's the tradeoff, I'm a little on the fence and could go either way.

If I'm understanding it correctly, would it be possible to update the test_stable_hash test with an example that illustrates the hash changing with this PR?

@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 12, 2024

To be clear, the goal here is to simplify the code a little bit? Are there any other intended improvements?

From this PR on its own, no other improvements. On its own, this is a small simplification that does not pull its weight. I have spiked out a larger re-factor that passes tests and that IMO does make things clearer and faster. It relies on several small changes like this that really need to be discussed thoroughly. So I wanted to pull out each of them in turn to make sure we found all of the unintended consequences.

And the downside is that if someone has a custom registry that canonicalizes differently (like a trailing slash, or upper-case characters in a GitHub URL), it will hash differently and they will end up with new directories in ~/.cargo/registry/*/..., and duplicated cache contents?

That is the only concern I'm aware of. But it is possible that I missed something.

If that's the tradeoff, I'm a little on the fence and could go either way.

Would it be helpful for me to post the full current spike as a draft PR to share where this is going?

If I'm understanding it correctly, would it be possible to update the test_stable_hash test with an example that illustrates the hash changing with this PR?

We can definitely add a unit test for this changing behavior. I should probably start by making sure it is in fact user facing with an end-to-end test. If it is, then yes a unit test is good idea. I will try and do that.

@Eh2406
Copy link
Contributor Author

Eh2406 commented Nov 14, 2024

And the downside is that if someone has a custom registry that canonicalizes differently (like a trailing slash, or upper-case characters in a GitHub URL), it will hash differently and they will end up with new directories in ~/.cargo/registry/*/..., and duplicated cache contents?

That is the only concern I'm aware of. But it is possible that I missed something.

While trying to update the tests as requested I thought of another concern. If the user has two custom registries that canonical eyes to the same thing, before this PR they would end up in separate directories in ~/.cargo/registry/*/... and now they would end up in the same folder. With this collision cause problems?

@weihanglo
Copy link
Member

, before this PR they would end up in separate directories in ~/.cargo/registry/*/... and now they would end up in the same folder. With this collision cause problems?

Theoretically yes. However I doubt somebody relies on this behavior to distinguish custom registries. It either needs to have trailing / or .git. or on gitub.com with protocol scheme other than https. We could notify interest groups in advance. Not sure what is the correct channel, t-cargo/build-integration doesn't seem to be the best fit.

@bors
Copy link
Contributor

bors commented Nov 15, 2024

☔ The latest upstream changes (presumably a8f7db2) made this pull request unmergeable. Please resolve the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants