Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a deterministic FxHashMap wrapper #63713

Closed
bjorn3 opened this issue Aug 19, 2019 · 48 comments
Closed

Create a deterministic FxHashMap wrapper #63713

bjorn3 opened this issue Aug 19, 2019 · 48 comments
Labels
A-reproducibility Area: Reproducible / deterministic builds C-cleanup Category: PRs that clean code up or issues documenting cleanup. C-enhancement Category: An issue proposing an enhancement or a PR with one. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@bjorn3
Copy link
Member

bjorn3 commented Aug 19, 2019

It should not provide iteration support, but only insert/remove/get/get_mut and convertion to a sorted vec. We should then use it where possible

This could prevent accidentially causing hashmap related non determinism in most cases.

Originally posted by @bjorn3 in #58281 (comment)

@jonas-schievink jonas-schievink added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Aug 19, 2019
@Centril Centril added E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. E-help-wanted Call for participation: Help is requested to fix this issue. C-cleanup Category: PRs that clean code up or issues documenting cleanup. labels Aug 19, 2019
@shivan1b
Copy link
Contributor

Hi @bjorn3 !
If this is not taken up by someone already, can I work on this? I'm a first time contributor to Rust and am browsing through the Easy marked issues.

@bjorn3
Copy link
Member Author

bjorn3 commented Aug 23, 2019

I dont think somebody else is working on this, so go for it.

This wrapper should be put in the rustc_data_structures crate.

@jonas-schievink jonas-schievink added the A-reproducibility Area: Reproducible / deterministic builds label Aug 23, 2019
@shivan1b
Copy link
Contributor

Hi @bjorn3 do you have name in mind for this wrapper? Also, is it OK if I take some time on this or is this an urgent requirement?

@bjorn3
Copy link
Member Author

bjorn3 commented Aug 26, 2019

Hi @bjorn3 do you have name in mind for this wrapper?

Not really.

Also, is it OK if I take some time on this or is this an urgent requirement?

You can probably take some time.

@petrochenkov
Copy link
Contributor

Semi-related, I'd also like to see an "accumulative" FxHashMap wrapper panicking on any attempts to change or remove existing values in the map.

Centril added a commit to Centril/rust that referenced this issue Sep 28, 2019
…=Mark-Simulacrum

data_structures: Add deterministic FxHashMap and FxHashSet wrappers

StableMap
A wrapper for FxHashMap that allows to `insert`, `remove`, `get`, `get_mut` and convert a hashmap into a sorted vector using the method `into_sorted_vector` but no iteration support.

StableSet
A wrapper for FxHashSet that allows to `insert`, `remove`, `get` and convert a hashset into a sorted vector using the method `into_sorted_vector` but no iteration support.

Addresses issue rust-lang#63713
@RalfJung
Copy link
Member

I was told once that FxHashMap does not have a seed in its hasher method and hence it actually is deterministic. Was that wrong information, or did things change since then?

@RalfJung
Copy link
Member

Cc @eddyb; I think it was them who told me that (but I might misremember)

@jonas-schievink
Copy link
Contributor

@RalfJung
Copy link
Member

But then, what is the purpose of the new StableMap? (Unfortunately the link to he original context links to the entire PR with >200 comments instead of the concrete comment triggering this, so I do not know the context of this.)

@bjorn3
Copy link
Member Author

bjorn3 commented Sep 29, 2019

(Unfortunately the link to he original context links to the entire PR with >200 comments instead of the concrete comment triggering this, so I do not know the context of this.)

#58281 (comment)

I was told once that FxHashMap does not have a seed in its hasher method and hence it actually is deterministic.

If an entry is added to a FxHashMap, can that affect the order of the entire map? For example when needing to resize?

@RalfJung
Copy link
Member

If an entry is added to a FxHashMap, can that affect the order of the entire map? For example when needing to resize?

Yes I think it can.
However, the order is fully deterministic: Given the same sequence of inserts and removes, the order of iteration will always be the same. If that is not sufficient for what you want, you are looking for a property that is stronger than determinism. (So, the issue title and OP text here are confusing.)

@bjorn3
Copy link
Member Author

bjorn3 commented Sep 29, 2019

Given the same sequence of inserts and removes, the order of iteration will always be the same.

+1

If that is not sufficient for what you want, you are looking for a property that is stronger than determinism.

Yeah, I guess so. It is just confusing and annoying that adding a single extra element to a FxHashMap causes the whole iteration order to change. It means that seemingly innocent changes can lead to big changes in the compilation result.

@RalfJung
Copy link
Member

RalfJung commented Sep 29, 2019

Yeah, I guess so. It is just confusing and annoying that adding a single extra element to a FxHashMap causes the whole iteration order to change. It means that seemingly innocent changes can lead to big changes in the compilation result.

Fair.

Even for the same domain (set of keys), if the sequence of inserts/removes is different, iteration order might differ.

@eddyb
Copy link
Member

eddyb commented Sep 29, 2019

It would probably still be nice to disallow misusing a hashmap without performance losses, regardless of what that would mean theoretically.
I would even just call it Map, but that might be confusing?

@RalfJung
Copy link
Member

RalfJung commented Oct 9, 2019

Sure. And StableMap is a reasonable name. But the documentation (and the title of this issue here) should be adjusted to explain that this is about small changes to the input resulting in a differently sorted output, and general unpredictable output sorting, but has nothing to do with determinism.

@moshg
Copy link

moshg commented Nov 6, 2019

How about using LinkedHashMap or IndexMap with FxHasher?
These will allow iteration in insertion order.

@Aaron1011
Copy link
Member

Aaron1011 commented Nov 6, 2019

Note that FxHashMap uses different hash algorithms on 32-bit vs 64-bit platforms, which can cause different iteration orders across platforms. I've hit this multiple times before, so if nothing else, it would be useful to have StableMap to help prevent this kind of issue.

@Aaron1011
Copy link
Member

See #47833 (comment) and #65043 for examples of issues caused by relying on the platform-dependent iteration order.

@cuviper
Copy link
Member

cuviper commented Sep 1, 2020

#64131 added StableMap and StableSet that don't let you access the "unstable" iterators, only into_sorted_vec(), but I don't see anything using these types yet. We also have FxIndexMap and FxIndexSet aliases for indexmap that keep insertion order, and these are used in quite a few places. Is there anything else needed?

@hameerabbasi
Copy link
Contributor

Since this issue has gone stale, I can attempt to pick it up. Is that okay, @ibraheemdev?

@ibraheemdev
Copy link
Member

@hameerabbasi I have some work done locally, but I think the better approach is to add an internal lint as @Mark-Simulacrum mentioned. If I haven't opened a PR in a week or feel free to pick this up :)

@Aaron1011
Copy link
Member

This came up in #91943

@hameerabbasi
Copy link
Contributor

Can someone post a summary of what needs to be done? Just change all instances of FxHashMap in the compiler with equivalent calls to StableMap instead?

Or is a lint the preferred way forward? I'll have time to work on this starting 23rd, but I realise it's the end of the year and no one may be available to mentor.

@RalfJung
Copy link
Member

Miri was also bitten by this, see rust-lang/miri#1937.

@hameerabbasi
Copy link
Contributor

@rustbot claim

@rustbot rustbot assigned hameerabbasi and unassigned ibraheemdev Dec 16, 2021
@pnkfelix pnkfelix added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Feb 1, 2022
@michaelwoerister
Copy link
Member

@hameerabbasi, I'd like to see some progress with this. I can't spend a huge amount of time on it -- but if that's not a problem, I'd be happy to do the mentoring.

@hameerabbasi
Copy link
Contributor

@michaelwoerister If you could post a summary of what to do and what files to touch that'd be great!

@michaelwoerister
Copy link
Member

OK, I'll try to do that some time next week.

@michaelwoerister
Copy link
Member

michaelwoerister commented Feb 9, 2022

OK, so the main goal here is to implement a map and a set type that satisfies the HashStable requirement that any observable change to its value will also result in a change to its Fingerprint (and that two values that have observable differences are also different according to PartialEq/Eq).

Currently HashMap, HashSet, BTreeMap, and BTreeSet all implement HashStable -- however, this actually isn't sound because they expose an observable ordering of their elements but that order is ignored by both their Eq and their HashStable implementations. The consequence is that the query system assumes that a value has not changed (because it still has the same fingerprint provided by HashStable), even though re-running the query would indeed produce a different result. In the worst case this can lead to miscompilations (most of which should be caught by fingerprint verification at this point though).

StableMap and StableSet were introduced to remedy this but they don't quite fit the bill yet. They do provide an into_sorted_vec() method, but sorting relies on Ord, which means that the ordering might rely on unstable things like memory addresses or interning IDs. We need to fix that and then start to replace occurrences of regular maps and sets with their stable counterparts. It should be sufficient to only replace occurrences inside of other types that implement HashStable.

So, my overall plan of action would be:

  1. Start with HashSet which is used in fewer places and should serve as a good proof of concept case.
  2. Add a HashStable implementation for StableSet that looks like the one we have for HashSet at the moment.
  3. Change into_sorted_vec() to sort by the result of ToStableHashKey::to_stable_hash_key() instead of Ord. The Vec::sort_by_cached_key() method should come in handy here.
  4. Put an #[inline] attribute on all methods of StableSet in order to avoid unnecessary performance issues.
  5. Remove the HashStable implementation for the regular HashSet. This should result in compilation errors telling you all the occurrences of regular HashSets that need to be replaced with StableSet. Perform those replacements.

These steps can later be repeated for HashMap, BTreeMap, and BTreeSet too -- but I suggest to stick to HashSet first and see how that goes. I suspect there will be one or two cases where it's not entirely clear how to proceed. We can look at those together.

@RalfJung
Copy link
Member

RalfJung commented Feb 9, 2022

Shouldn't the BTree collections be fine, since their iteration order is given by PartialOrd, which needs to be compatible with Eq?

@hameerabbasi
Copy link
Contributor

hameerabbasi commented Feb 10, 2022

Shouldn't the BTree collections be fine, since their iteration order is given by PartialOrd, which needs to be compatible with Eq?

I think this part of the comment addresses why PartialOrd alone isn't sufficient (please correct me if I'm wrong, always happy to learn):

They do provide an into_sorted_vec() method, but sorting relies on Ord, which means that the ordering might rely on unstable things like memory addresses or interning IDs.

As an add-on: I believe I will have time for this this weekend. I'll see how far I can get.

@RalfJung
Copy link
Member

Hm, but that same concern could also apply to Eq, which might also rely on unstable things. Any into_sorted_vec will have that problem, even the one on FxHashMap, won't it? That seems rather different from the instability of hash map iteration order.

@hameerabbasi
Copy link
Contributor

If I understand the problem correctly, the correct solution seems to be excluding certain fields from comparison completely. Perhaps the derivative crate can help here.

@michaelwoerister
Copy link
Member

Shouldn't the BTree collections be fine, since their iteration order is given by PartialOrd, which needs to be compatible with Eq?

Normally, yes. But in the context of incremental compilation, we'd like to have a stronger form of stability, spanning across multiple compilation sessions. For things that implement HashStable (which acts as a marker for "stable across compilation sessions"), we actually also want Eq to take any observable properties into account (as opposed to just being in sync with PartialOrd). There has been discussion around introducing a new trait HashStableEq for that or adding a lint that enforces the stricter requirements on Eq if the type also implements HashStable here but that's still ongoing.

Overall the situation there is quite unsatisfactory, especially around "pointer-like" things like DefId. HashStable currently treats them as reference without an observable address (i.e. it only hashes their contents) but in reality, some parts of the compiler make decisions based on the numeric value of their "address" (i.e. the numeric value of the CrateNum and DefIndex in there).

It's fair to say that there might be a place for both: a Map/Set type that just takes care of regular iteration order determinism (like BTreeMap) and one that is also suited for having a proper HashStable implementation. But the latter should be able to cover most use cases for the former without much of a downside, I hope.

@hameerabbasi
Copy link
Contributor

I've begun work on this in #94188.

@apiraino apiraino added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels May 19, 2022
@apiraino
Copy link
Contributor

Mirroring the comment on the sibling issue.

I see some work happened in #102698, how is it relevant to this issue. Can this be closed?

@michaelwoerister
Copy link
Member

In my opinion, this particular issue can be closed. The deterministic wrappers exist in the form of UnordMap and UnordSet. They haven't been deployed everywhere yet -- but widely enough to say with some confidence that they do the job (together with FxIndexMap and FxIndexSet and the indeterminism lint).

The other issue being linked needs to be checked separately though. In #106977 and #108312 I've found that most uses of FxHashMap can be replaced pretty easily with UnordMap (or FxIndexMap if that is a better fit).

@apiraino
Copy link
Contributor

apiraino commented Apr 19, 2023

@michaelwoerister thanks for the feedback! Will close this one, please let me know if there are opposing opinions

@apiraino apiraino removed the E-help-wanted Call for participation: Help is requested to fix this issue. label Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-reproducibility Area: Reproducible / deterministic builds C-cleanup Category: PRs that clean code up or issues documenting cleanup. C-enhancement Category: An issue proposing an enhancement or a PR with one. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests