-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HashSet.GetOrAdd(T) #21603
Comments
Similar to https://github.com/dotnet/corefx/issues/11291 but not exactly a duplicate, since I don't want a generic parameter for the key or any other distinctions between keys and values. |
You're not, if you're on .Net Core 2.0, you can use On one hand, the addition of |
Sweet! That was the thread I couldn't find! So implementing GetOrAdd should be a piece of cake. @TylerBrinkley? :D |
@jnm2 there is a code size cost to adding new API to commonly instantiated generic types which means there is a higher bar for adding them. Consensus we achieved is here https://github.com/dotnet/corefx/issues/17275#issuecomment-291636863 |
The extension method still has a downside, which is the duplicate hashtable lookup. In that sense, this is a fundamental operation. Too bad it can't be the other way around, other instance methods like Contains as extension methods implemented in terms of the fundamentals. |
I think this one falls under the first criteria @terrajobst set for #17275 which is
While there is a performance hit of 2 lookups I don't think this method would be used all that much. |
If we're going to do this, can we also have a |
@danmosemsft #20073 doesn't provide any way to do public static T GetOrAdd<T>(this HashSet<T> hashSet, T equalValue)
{
if (hashSet.TryGetValue(equalValue, out var actualValue))
{
return actualValue;
}
hashSet.Add(equalValue);
return equalValue;
} @TylerBrinkley makes the point above that the performance hit of these duplicate lookups may not be important enough to add the instance method (requiring a bit more JIT time per generic instantiation). There is indeed precedent for preferring extension methods at the cost of duplicate hash lookups: However, see the opposite point being made by @jamesqo in the Dictionary.GetOrAdd issue: https://github.com/dotnet/corefx/issues/2877#issuecomment-319244808 |
Ah yes i forgot we made an extension method. Given that decision it sounds like this issue can be closed? |
@danmosemsft #15059 is about Dictionary.GetOrAdd; this one is about HashSet.GetOrAdd. As far as I'm aware there is no extension method for GetOrAdd for either one. |
@danmosemsft I'm a bit confused here. Does this really have such a big impact on code size? I think I'd be more worried personally about the performance impact of the second lookup than a tiny bit more code for each generic instantiation, but I guess that's just me (and I guess it really depends on whether people will actually use this instance method on hot code paths). Am I understanding properly that the impact would be an additional JITed function and an additional vtable entry (one each per generic instantiation, not per instance of objects), and the cost to perform the JIT on first use of said function? |
@jnm2 my bad, there's of course two asks here, one was "a custom equality comparison being done but I need access to the original instance." which seems now satisfied by As for the cost of adding to the type itself (in this case, necessary to avoid dual lookups) it is not the JIT time but the code duplication. @jkotas can you remind me of the kind of size increase you were quoting for a reasonably sized method addition to Dictionary? |
I have been looking at the code size vs. benefit. I do not think I have said a universal number on how much it is ok to add. |
I meant, if I add a new 10-20 line method to Dictionary, as a rule of thumb how much does that increase the size of a typical app. I believe you gave a number but I cannot find it. |
I feel like we should close this. Given we're going down the CollectionsMarshal route for #15059, we should probably consider a similar approach for HashSet. cc @GrabYourPitchforks |
For the record, using CollectionsMarshal for HashSet was proven to be not viable because you can't do ref returns on keys: That being said, consensus is that a |
I think this is a very narrow view of the HashSet value. I've personally used HashSet many times where you want to "deduplicate" a bunch of reference objects that are "identical" in semantic meaning but occupying different memory. This is precisely the use case where GetOrAdd would be valuable and would greatly improve the performance. That said, I'm not sure I understand why TryGetValue doesn't do what is needed. |
I don't intend to, this is a discussion about adding a new API and not removing and old API 🙂
I never claimed that types have a unique notion of equality, what I did say is that under a given definition of equality two equal values should be considered equivalent.
Same here, but the mere act of deduplication implies that you're happy to discard references up to that notion of equality, which suggests that these references are equivalent in your domain.
How would a |
@eiriktsarpalis since you are arguing against the proposed I think that the public static T GetOrAdd<T>(this HashSet<T> source, T item)
{
ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
// In case the item is not found, it has been added to the set and assigned to the itemRef.
return itemRef;
} Although the public static T GetOrAdd<T, TArg>(this HashSet<T> source, T item, Func<TArg, T> itemFactory, TArg factoryArgument)
{
ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
if (!exists)
{
item = itemFactory(factoryArgument);
Debug.Assert(source.Comparer.Equals(item, itemRef));
itemRef = item;
}
return itemRef;
} This one transfers the responsibility of correctness to the caller. It's up to the |
I cannot for the life of me understand why the Factory version of this GetOrAdd for HashSet would be useful. If you've already constructed the T, why do you need a factory to help you create ANOTHER one? I totally understand the value of GetOrAdd(item) though, as it serves the purpose I mentioned earlier. @eiriktsarpalis - the performance improvement for a GetOrAdd in the dedup situation is that you don't "search for it" then "add it if you didn't find it" in two steps since these are two hashes and searches. Similar to the argument for the same thing for Dictionary. |
@kellypleahy the purpose of the factory-based |
I think the difference here with |
Hmm... that one seems really bad to me, but I guess that's just my opinion. I'm 100% +1 on |
@TylerBrinkley in case you have any idea about what is the static footprint cost of adding a method in the |
@kellypleahy in case the |
I don't understand. There's no need to make multiple lookups when deduplicating with today's APIs: IEnumerable<T> Dedup<T>(IEnumerable<T> source)
{
HashSet<T> set = new();
foreach (T item in source)
{
if (set.Add(item)) yield return item;
}
} |
Sorry, I meant dedup when, for instance, deserializing. Not dedup for the purposes of "unique" in an array or result set. In other words, if you have an immutable datastructure that you are deserializing that might have many "duplicates" of leaf nodes in it (this is very common for my applications) or making sure that you don't have 100s of copies of the same string stored in a graph, then you need two lookups (because you need the item from the set). |
The discussion around an existing shipped API, and what should be done about it, is entirely separate from the discussion around adding new functionality. If you want to discuss that, I would recommend a separate discussion. At best, It definitely will not help with this discussion. At worst it will definitely derail it by adding confusion and muddying the topic. |
@CyrusNajmabadi I have no reason to start a discussion about the |
It doesn't. The two are separate discussions. As erik rightly pointed out
The new API has to stand on its own merits. Erik has made reasonable claims both as to the goals and designs of this API and the very real and relevant perf concerns the runtime has here. The latter of which is extremely important given hte size of hte ecosystem here and how impacted they are by that. This means there is a very high bar to clear, despite how desirable you might want this. Furthermore, i'll go back to my previous point. If the perf concerns of the runtime are not acceptable to you as part of the decision makign process, then the same applies in reverse. If we take perf out, then just use a dictionary, or just use an extension that makes multiple calls. If these are unpalatable to you because of the perf costs they will have, then please recognize the same concerns apply to the runtime for a the much larger and much more perf sensitive ecosystem they are targetting.
Ok. Then that should be tabled. It's presence isn't part of the discussion here or the argument as to whether or not GetOrAdd be added. |
@CyrusNajmabadi the
I never said that it's not acceptable. I said that it's depressing. I wish that the static footprint issue had been addressed with some clever technical solution, instead of addressing it by restricting the evolution of the APIs. |
As already stated: "it shouldn't be carte blanche for us to follow up with more mistakes." Many API authors have intimate experience with this. Your past isn't perfect, and there are things you wish you hadn't done. But you learn from that and use that to inform future decisions. That means you pay more attention and are more careful. The experience with mistakes motivates you to go away from that space, not more toward it. |
I'm not sure what can be done about that. However, it's part and parcel of every api decision made practically everywhere that their are constraints you have to work within. That's just life :) Maybe the constraints change in the future and you can revisit some decisions. |
And why exactly was the |
I'd consider myself harmed, due to the confusion such an API causes, and the bizarre sort of dual view where there can be different equality concepts. -- Yup, giving this more thought, i really think that API was def a mistake. Even the use case is much better served by using a Dictionary. It's clear, reasonable, and already about what Dictionarys are intended for (mapping from one domain to another). That the key and value domains overlap in terms of their types, that they have different equality concepts (IEquatable vs Identity) means they should be separated out and not collapsed into a single entity to me. |
Note: i'm not opposed to a back-door for perf. But i would keep well outside of the mainline api as this just feels very broken. |
@CyrusNajmabadi well, you can certainly substitute hashsets with dictionaries in most everyday scenarios. If you need a (I am half-joking here, because I think that you are half-joking as well.) |
I'm not joking at all. I think sets should be used only when everything is treated uniformly in the domain of hte equality contract picked for the collection. If you need differing equality contracts, use a dictionary, it properly models that these are distinct spaces. As i mentioned already, i would not be opposed to a back-door api here if absolute perf is needed. But i would not make it part of the core surface contract of HashSet.
If i have such a set with such strange equality concepts, i would be very wary about using any of those apis. I would def want to spell out precisely what semantics i was expecting, and would likely prefer a dictionary here to keep things clear. |
@CyrusNajmabadi the concept of embedding a key inside the items stored in a collection is as old as the |
That's sometimes how things just fall out. Trust me, there are probably around 500 apis i'd love that would make my life materially better. But i'm not going to be able to adequately justify those apis happening in the core runtime apis themselves. That's life.
Yes. For context, i've worked on .Net since before it was 1.0. I'm very familiar with its history (and its massive warts). I don't look at things like that as any sort of good argument for continuing with such decisions in the future. As mentioned before, i want us to look to the past to see what we can learn from it, esp. about what not to do.
Sure. But, like with all api requests, just because you have a need/desire here, is only part of the equation. Every API request ever is driven by people trying to solve problems, where they've decided to use some API and have found it lacking. The goal of the runtime is to get in that feedback, and weigh all the factors involved, to make the best choices for the needs of the ecosystem as a whole. And, a lot of the time, that means that people wanting those api changes will not get what they want. In that case you have to decide how impactful that is to you. If it's something you can live with, great. If not, that sometimes means you have to build your own solution. As someone who has built easily hundreds of APIs (some of which did end up rolling into the BCL, but most of which did not), i'm very familiar with this :) |
@kellypleahy fully deserializing an object only to later discard it on the basis of duplication sounds inefficient to me. The cost of double lookup should be multiple orders of magnitude smaller compared to deserialization so a
You keep repeating that my argument is theoretical, but bugs related to inappropriate use of equality are very real and very common. In concrete terms, this pattern encourages definitions of equality that only make sense when scoped to one particular hash table. This creates problems because, generally speaking, we encourage that types have universal equality semantics (e.g. via the CA2231 and CA1067 analyzers). Assuming that equality uses projection semantics (e.g. like the If there's a broader takeaway to be made here, equality is hard and you should avoid implementing it manually unless there is a very compelling reason to do it. The language(s) and framework provide tools that let you (mostly) delegate that responsibility: I personally use tuples and records whenever I have a need to represent structural equality for complex data: items.DistinctBy(item => (item.Date, item.Amount));
public sealed class Item // reference equality
{
public DateTime Date { get; }
public int Amount { get; }
public List<int> RelatedInfo { get; }
}
The BCL is an aggregate of over 20 years of shifting programming paradigms, patterns and, yes, in some cases mistakes. Some APIs have stood the test of time, others not so much. It does not follow that we should be doubling down on a particular pattern just because an instance of it got shipped at some point in time. Neither does not aggressively removing/deprecating APIs deemed obsolete by one member of the BCL team betray some sort of hypocricy or double standard as you're suggesting: it's simply not worth doing from a cost-benefit analysis for most cases. |
@eiriktsarpalis how many bugs do you believe have been caused by the introduction of the I should mention that not only the |
I don't think anybody in this thread expressed interest in preventing users from using HashSet to their heart's content. The concern that actually has been expressed is that we don't introduce new methods to popular generic types willy-nilly, particularly ones that are of questionable utility and can be worked around trivially as you are pointing out. |
Eirik I interpreted this previous comment:
...as a desire to discourage developers from making creative use of the |
There are many times when there is a custom equality comparison being done but I need access to the original instance.
What I'd like:
Today I'm forced to use a Dictionary which duplicates storage of the key for no reason.
A dictionary is also wasteful because it has to do two hash lookups, one for
TryGetValue
and one forAdd
:This is no better; it's a lot more to maintain, unnecessarily coupled to the definition of equality, it's still duplicate storage of the key, and it's still duplicating the number of hash lookups:
KeyedCollection is the worst of all because in addition to the dictionary, it keeps a list I'll never use.
Proposed API
The text was updated successfully, but these errors were encountered: