HashSet.GetOrAdd(T) #21603

jnm2 · 2017-05-09T23:08:14Z

There are many times when there is a custom equality comparison being done but I need access to the original instance.

What I'd like:

var mergedItems = new HashSet<Item>();

foreach (var item in items)
{
    mergedItems.GetOrAdd(item).RelatedInfo.Add(x);
}

public sealed class Item : IEquatable<Item>
{
    public DateTime Date { get; }
    public int Amount { get; }
    public List<int> RelatedInfo { get; }

    public bool Equals(Item other) => other != null && other.Date == Date && other.Amount == Amount;

    public override bool Equals(object obj) => obj is Item item && Equals(item);

    public override int GetHashCode() => // ...
}

Today I'm forced to use a Dictionary which duplicates storage of the key for no reason.
A dictionary is also wasteful because it has to do two hash lookups, one for TryGetValue and one for Add:

var mergedItems = new Dictionary<Item, Item>();

foreach (var item in items)
{
    if (!mergedItems.TryGetValue(item, out var otherItem))
        mergedItems.Add(item, otherItem = item);
    otherItem.RelatedInfo.Add(x);
}

This is no better; it's a lot more to maintain, unnecessarily coupled to the definition of equality, it's still duplicate storage of the key, and it's still duplicating the number of hash lookups:

var mergedItems = new Dictionary<(DateTime, int), Item>();

foreach (var item in items)
{
    if (!mergedItems.TryGetValue((item.Date, item.Amount), out var otherItem))
        mergedItems.Add((item.Date, item.Amount), otherItem = item);
    otherItem.RelatedInfo.Add(x);
}

KeyedCollection is the worst of all because in addition to the dictionary, it keeps a list I'll never use.

Proposed API

namespace System.Collections.Generic;

public class HashSet<T>
{
+   /// <summary>
+   /// Searches the set for a given element and returns the equal element it finds, if any;
+   /// otherwise, adds the element to the set.
+   /// </summary>
+   /// <param name="item">The element to search for, or to add to the set.</param>
+   /// <returns>
+   /// The equal element that the search found, or the given element that was added to the set.
+   /// </returns>
+   public T GetOrAdd(T item);
}

The text was updated successfully, but these errors were encountered:

jnm2 · 2017-05-09T23:11:25Z

Similar to https://github.com/dotnet/corefx/issues/11291 but not exactly a duplicate, since I don't want a generic parameter for the key or any other distinctions between keys and values.

svick · 2017-05-10T00:38:27Z

Today I'm forced to use a Dictionary which duplicates storage of the key for no reason.

You're not, if you're on .Net Core 2.0, you can use HashSet<T>.TryGetValue() (https://github.com/dotnet/corefx/issues/15691), though it still requires two lookups.

On one hand, the addition of TryGetValue indicates that retrieving a value from a HashSet based on a equal, but distinct value makes sense. On the other hand, it means adding GetOrAdd is less important: it improves performance and convenience, but doesn't add new capabilities.

jnm2 · 2017-05-10T00:49:38Z

Sweet! That was the thread I couldn't find! So implementing GetOrAdd should be a piece of cake. @TylerBrinkley? :D

danmoseley · 2017-05-10T04:15:57Z

@jnm2 there is a code size cost to adding new API to commonly instantiated generic types which means there is a higher bar for adding them. Consensus we achieved is here https://github.com/dotnet/corefx/issues/17275#issuecomment-291636863
Given this one could be achieved by an extension method, that might be the most palatable approach.,

jnm2 · 2017-05-10T04:46:02Z

The extension method still has a downside, which is the duplicate hashtable lookup. In that sense, this is a fundamental operation. Too bad it can't be the other way around, other instance methods like Contains as extension methods implemented in terms of the fundamentals.

TylerBrinkley · 2017-05-10T05:04:51Z

I think this one falls under the first criteria @terrajobst set for #17275 which is

CONSIDER making rare methods on generic types extension methods to avoid forcing the code gen to instantiate all these methods

While there is a performance hit of 2 lookups I don't think this method would be used all that much.

kellypleahy · 2017-06-08T05:55:21Z

If we're going to do this, can we also have a GetOrAdd that takes the factory function? This is something that seems also quite valuable? Also, can we put these on Dictionary<TKey,TValue> (as extension methods) if they don't already exist too? Is there already a ticket for Dictionary<TKey,TValue>?

danmoseley · 2018-08-19T16:32:44Z

@jnm2 can you clarify how this is not satisfied by #20073 ? I'm missing it.

jnm2 · 2018-08-19T16:55:48Z

@danmosemsft #20073 doesn't provide any way to do GetOrAdd without a duplicate hash lookup, that I can see. It enables us to do this, but in the 'add' scenario, there is duplicate work being done in the calls to TryGetValue and Add:

public static T GetOrAdd<T>(this HashSet<T> hashSet, T equalValue)
{
    if (hashSet.TryGetValue(equalValue, out var actualValue))
    {
        return actualValue;
    }

    hashSet.Add(equalValue);
    return equalValue;
}

@TylerBrinkley makes the point above that the performance hit of these duplicate lookups may not be important enough to add the instance method (requiring a bit more JIT time per generic instantiation).

There is indeed precedent for preferring extension methods at the cost of duplicate hash lookups:

https://github.com/dotnet/corefx/blob/b9ce86fbec92349c203a4c33805cffb8ca24c10f/src/System.Collections/src/System/Collections/Generic/CollectionExtensions.cs#L25-L39

However, see the opposite point being made by @jamesqo in the Dictionary.GetOrAdd issue: https://github.com/dotnet/corefx/issues/2877#issuecomment-319244808

danmoseley · 2018-08-19T17:02:42Z

Ah yes i forgot we made an extension method. Given that decision it sounds like this issue can be closed?
The cost of growing commonly used generic types is the code size. Dictionary is one of the most sensitive, a small change can grow hello world size significantly.

jnm2 · 2018-08-19T17:05:10Z

@danmosemsft #15059 is about Dictionary.GetOrAdd; this one is about HashSet.GetOrAdd. As far as I'm aware there is no extension method for GetOrAdd for either one.

kellypleahy · 2018-08-19T20:39:00Z

@danmosemsft I'm a bit confused here. Does this really have such a big impact on code size? I think I'd be more worried personally about the performance impact of the second lookup than a tiny bit more code for each generic instantiation, but I guess that's just me (and I guess it really depends on whether people will actually use this instance method on hot code paths). Am I understanding properly that the impact would be an additional JITed function and an additional vtable entry (one each per generic instantiation, not per instance of objects), and the cost to perform the JIT on first use of said function?

danmoseley · 2018-08-20T17:05:36Z

@jnm2 my bad, there's of course two asks here, one was "a custom equality comparison being done but I need access to the original instance." which seems now satisfied by HashSet<T>.TryGetValue (as already observed above) and the other is to be able to "GetOrAdd", which as you say is not currently satisfied, although on Dictionary<K,V> there is TryAdd it doesn't return the value.

As for the cost of adding to the type itself (in this case, necessary to avoid dual lookups) it is not the JIT time but the code duplication. Dictionary<K,V> is instantiated many times in a typical app - once for reference types, then once for (I guess - not my area of expertise) each sized value type, and that can add up significantly. So several proposed additions to the type were either abandoned or moved to extension methods (which is unfortunate if like in this case it would be less efficient). I would expect this to be a lesser concern for HashSet since it's presumably instantiated for fewer types.

@jkotas can you remind me of the kind of size increase you were quoting for a reasonably sized method addition to Dictionary?

jkotas · 2018-08-20T17:40:01Z

can you remind me of the kind of size increase you were quoting for a reasonably sized method addition to Dictionary?

I have been looking at the code size vs. benefit. I do not think I have said a universal number on how much it is ok to add.

danmoseley · 2018-08-20T17:52:24Z

I meant, if I add a new 10-20 line method to Dictionary, as a rule of thumb how much does that increase the size of a typical app. I believe you gave a number but I cannot find it.

eiriktsarpalis · 2020-11-10T15:57:50Z

I feel like we should close this. Given we're going down the CollectionsMarshal route for #15059, we should probably consider a similar approach for HashSet. cc @GrabYourPitchforks

eiriktsarpalis · 2023-03-02T19:23:44Z

For the record, using CollectionsMarshal for HashSet was proven to be not viable because you can't do ref returns on keys:

#82840 (comment)

That being said, consensus is that a GetOrAdd method for HashSet is not particularly valuable compared to just using HashSet.Add except for the rare scenario where you need to access the actual instance stored in the HashSet:

#82875 (comment)

kellypleahy · 2024-04-17T17:26:58Z

Let me try to rephrase what I believe is problematic with this group of proposals:

When modelling a HashSet<T> where T has a certain definition of equality it is good practice to regard t1 : T and t2 : T as interchangeable (modulo mutability concerns) whenever t1.Equals(t2). This invariant is observed in all definitions of equality that the framework ships regardless of whether it uses reference or structural equality. Because T is the key of HashSet<T> you can't actually look it up unless you already have a T. But if you already have the T then

There is no need to bother with delayed initialization factories such as the ones declared here and

There is no need to retrieve the underlying stored T if it exists because the two should be interchangeable. This implies that bool Contains(T item) and bool Add(T item) are the only APIs needed to interrogate the hash set.

In my opinion, types whose equality doesn't imply interchangeability gives a strong indication that the values should be indexed by a different key type using a dictionary.

I think this is a very narrow view of the HashSet value. I've personally used HashSet many times where you want to "deduplicate" a bunch of reference objects that are "identical" in semantic meaning but occupying different memory. This is precisely the use case where GetOrAdd would be valuable and would greatly improve the performance. That said, I'm not sure I understand why TryGetValue doesn't do what is needed.

eiriktsarpalis · 2024-04-17T19:24:15Z

Proving that all these people participated in a mistake is not a trivial task! 🙂

I don't intend to, this is a discussion about adding a new API and not removing and old API 🙂

The notion "two Ts that are equal by one definition are equal by all definitions" never existed for the HashSet<T>.

I never claimed that types have a unique notion of equality, what I did say is that under a given definition of equality two equal values should be considered equivalent. IEqualityComparer<T> is a tool that lets you override built-in equality, but it shouldn't invalidate that principle (in practice this can be abused).

I've personally used HashSet many times where you want to "deduplicate" a bunch of reference objects that are "identical" in semantic meaning but occupying different memory.

Same here, but the mere act of deduplication implies that you're happy to discard references up to that notion of equality, which suggests that these references are equivalent in your domain.

This is precisely the use case where GetOrAdd would be valuable and would greatly improve the performance.

How would a GetOrAdd method improve performance in a deduplication method?

theodorzoulias · 2024-04-17T21:17:00Z

Proving that all these people participated in a mistake is not a trivial task! 🙂

I don't intend to, this is a discussion about adding a new API and not removing and old API 🙂

@eiriktsarpalis since you are arguing against the proposed GetOrAdd API from a theoretical/philosophical/academic standpoint, and since this API is building on the precedence of the existing TryGetValue API, I think that it would be intellectually consistent to argue against both APIs. Either both are mistakes, or none. If you think that the TryGetValue was approved incorrectly, it's still possible to mitigate this mistake by obsoleting the API, hiding it from the tools, updating the documentation with warnings against using it etc.

I think that the GetOrAdd could be implemented with the help of the CollectionsMarshal like this:

public static T GetOrAdd<T>(this HashSet<T> source, T item)
{
    ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
    // In case the item is not found, it has been added to the set and assigned to the itemRef.
    return itemRef;
}

Although the CollectionsMarshal.GetItemRefOrAdd (as proposed in #82840) is unsafe, the above GetOrAdd implementation is not. The caller of the GetOrAdd cannot misuse it in a way that would corrupt the HashSet<T>. The same cannot be said for the factory-based GetOrAdd:

public static T GetOrAdd<T, TArg>(this HashSet<T> source, T item, Func<TArg, T> itemFactory, TArg factoryArgument)
{
    ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
    if (!exists)
    {
        item = itemFactory(factoryArgument);
        Debug.Assert(source.Comparer.Equals(item, itemRef));
        itemRef = item;
    }
    return itemRef;
}

This one transfers the responsibility of correctness to the caller. It's up to the itemFactory to return a T equal to the original T. Checking for equality in release mode would defeat the whole purpose of this API, which is to have comparable performance with a single call to TryGetValue or a single call to Add.

kellypleahy · 2024-04-17T21:28:53Z

Although the CollectionsMarshal.GetItemRefOrAdd (as proposed in #82840) is unsafe, the above GetOrAdd implementation is not. The caller of the GetOrAdd cannot misuse it in a way that would corrupt the HashSet<T>. The same cannot be said for the factory-based GetOrAdd:
public static T GetOrAdd<T, TArg>(this HashSet<T> source, T item, Func<TArg, T> itemFactory, TArg factoryArgument)
{
    ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
    if (!exists)
    {
        item = itemFactory(factoryArgument);
        Debug.Assert(source.Comparer.Equals(item, itemRef));
        Debug.Assert(source.Comparer.GetHashCode(item) == source.Comparer.GetHashCode(itemRef));
        itemRef = item;
    }
    return itemRef;
}
This one transfers the responsibility of correctness to the caller. It's up to the itemFactory to return a T equal to the original T. Checking for equality in release mode would defeat the whole purpose of this API, which is to avoid hashing the T element more than once.

I cannot for the life of me understand why the Factory version of this GetOrAdd for HashSet would be useful. If you've already constructed the T, why do you need a factory to help you create ANOTHER one? I totally understand the value of GetOrAdd(item) though, as it serves the purpose I mentioned earlier.

@eiriktsarpalis - the performance improvement for a GetOrAdd in the dedup situation is that you don't "search for it" then "add it if you didn't find it" in two steps since these are two hashes and searches. Similar to the argument for the same thing for Dictionary.

theodorzoulias · 2024-04-17T21:33:11Z

@kellypleahy the purpose of the factory-based GetOrAdd is to avoid an allocation in case the T is reference type. For context see this previous comment.

TylerBrinkley · 2024-04-17T21:38:56Z

I think the difference here with TryGetValue vs GetOrAdd is that TryGetValue unlocked functionality that was not possible before but this is only for performance and convenience reasons which doesn't seem as compelling given the static footprint costs.

kellypleahy · 2024-04-17T21:43:14Z

@kellypleahy the purpose of the factory-based GetOrAdd is to avoid an allocation in case the T is reference type. For context see this previous comment.

Hmm... that one seems really bad to me, but I guess that's just my opinion. I'm 100% +1 on GetOrAdd(item). Much more dubious on GetOrAdd<TArg>(item, Func<TArg, T> factory, TArg factoryArg). I can only imagine that being confusing at best. If you really need to "initialize" the object that you passed to item, I think it should be reasonably easy for you to know that the item you passed was the one added (maybe object.ReferenceEquals() or something). I do see the use case you are talking about, but I feel like it would be mostly just confusing to callers and handles a VERY rare case and may not even provide a performance boost at all over the double lookups/insert (I'm not sure I have the brainpower to confirm or deny a benefit at the moment).

theodorzoulias · 2024-04-17T21:48:41Z

@TylerBrinkley in case you have any idea about what is the static footprint cost of adding a method in the HashSet<T> collection, please share it because I would really love to know. Otherwise, how are you able to make a judgment? On the one hand we have performance and convenience, and on the other hand we have a mystery cost. That's hardly enough data for judging anything.

theodorzoulias · 2024-04-17T21:57:26Z

@kellypleahy in case the item already exists in the set, the GetOrAdd will return the existing item, not the item passed as argument. There is a different proposal about a method AddOrUpdate that always adds the item argument in the set, replacing and discarding any existing element.

eiriktsarpalis · 2024-04-17T22:08:55Z

the performance improvement for a GetOrAdd in the dedup situation is that you don't "search for it" then "add it if you didn't find it" in two steps since these are two hashes and searches. Similar to the argument for the same thing for Dictionary.

I don't understand. There's no need to make multiple lookups when deduplicating with today's APIs:

IEnumerable<T> Dedup<T>(IEnumerable<T> source)
{
    HashSet<T> set = new();
    foreach (T item in source)
    {
        if (set.Add(item)) yield return item;
    }
}

kellypleahy · 2024-04-17T22:19:32Z

the performance improvement for a GetOrAdd in the dedup situation is that you don't "search for it" then "add it if you didn't find it" in two steps since these are two hashes and searches. Similar to the argument for the same thing for Dictionary.

I don't understand. There's no need to make multiple lookups when deduplicating with today's APIs:
IEnumerable<T> Dedup<T>(IEnumerable<T> source)
{
    HashSet<T> set = new();
    foreach (T item in source)
    {
        if (set.Add(item)) yield return item;
    }
}

Sorry, I meant dedup when, for instance, deserializing. Not dedup for the purposes of "unique" in an array or result set. In other words, if you have an immutable datastructure that you are deserializing that might have many "duplicates" of leaf nodes in it (this is very common for my applications) or making sure that you don't have 100s of copies of the same string stored in a graph, then you need two lookups (because you need the item from the set).

CyrusNajmabadi · 2024-04-18T04:12:37Z

I think that it would be intellectually consistent to argue against both APIs. Either both are mistakes, or none. If you think that the TryGetValue was approved incorrectly, it's still possible to mitigate this mistake by obsoleting the API, hiding it from the tools, updating the documentation with warnings against using it etc.

The discussion around an existing shipped API, and what should be done about it, is entirely separate from the discussion around adding new functionality.

If you want to discuss that, I would recommend a separate discussion. At best, It definitely will not help with this discussion. At worst it will definitely derail it by adding confusion and muddying the topic.

theodorzoulias · 2024-04-18T05:01:27Z

@CyrusNajmabadi I have no reason to start a discussion about the TryGetValue API, because I think that it's a useful API, and without it the HashSet<T> would be incomplete. Eirik has a theoretical argument against the GetOrAdd proposal, which also applies to the existing TryGetValue API. I am mainly pointing out that the mere existence of this API weakens his argument. Eirik can't just say, or imply, that the approval of the TryGetValue was a mistake, and then carry on with his argument. I mean, he can, but then his argument will remain weak.

CyrusNajmabadi · 2024-04-18T05:06:25Z

I am mainly pointing out that the mere existence of this API weakens his argument.

It doesn't. The two are separate discussions. As erik rightly pointed out

"I don't think the presence of an API that somehow got shipped in the past refutes the point I'm trying to make."

The new API has to stand on its own merits. Erik has made reasonable claims both as to the goals and designs of this API and the very real and relevant perf concerns the runtime has here. The latter of which is extremely important given hte size of hte ecosystem here and how impacted they are by that. This means there is a very high bar to clear, despite how desirable you might want this.

Furthermore, i'll go back to my previous point. If the perf concerns of the runtime are not acceptable to you as part of the decision makign process, then the same applies in reverse. If we take perf out, then just use a dictionary, or just use an extension that makes multiple calls. If these are unpalatable to you because of the perf costs they will have, then please recognize the same concerns apply to the runtime for a the much larger and much more perf sensitive ecosystem they are targetting.

I have no reason to start a discussion about the TryGetValue API, because I think that it's a useful API

Ok. Then that should be tabled. It's presence isn't part of the discussion here or the argument as to whether or not GetOrAdd be added.

theodorzoulias · 2024-04-18T05:20:41Z

@CyrusNajmabadi the HashSet<T> is already shaped by the existence of the TryGetValue. Discussing whether the HashSet<T> should acknowledge differing definitions of equality is a moot discussion. It already does. And thankfully so.

If the perf concerns of the runtime are not acceptable to you as part of the decision making process, then the same applies in reverse.

I never said that it's not acceptable. I said that it's depressing. I wish that the static footprint issue had been addressed with some clever technical solution, instead of addressing it by restricting the evolution of the APIs.

CyrusNajmabadi · 2024-04-18T05:43:54Z

It already does.

As already stated: "it shouldn't be carte blanche for us to follow up with more mistakes."

Many API authors have intimate experience with this. Your past isn't perfect, and there are things you wish you hadn't done. But you learn from that and use that to inform future decisions. That means you pay more attention and are more careful. The experience with mistakes motivates you to go away from that space, not more toward it.

CyrusNajmabadi · 2024-04-18T05:48:43Z

I never said that it's not acceptable. I said that it's depressing.

I'm not sure what can be done about that. However, it's part and parcel of every api decision made practically everywhere that their are constraints you have to work within. That's just life :) Maybe the constraints change in the future and you can revisit some decisions.

theodorzoulias · 2024-04-18T06:03:55Z

As already stated: "it shouldn't be carte blanche for us to follow up with more mistakes."

And why exactly was the TryGetValue a mistake? It is certainly benefiting the developers who are searching for it, finding it, and using it. Do you know of anyone who, on the basis of Eirik's theoretical argument, has been harmed by it? This question is relevant to our discussion about the GetOrAdd, because if the TryGetValue hasn't harmed anyone, neither the GetOrAdd will. These two APIs are interconnected. Currently the only way to implement the GetOrAdd (inefficiently though), is to search for the element with the TryGetValue, and if it's not found then Add it to the set.

CyrusNajmabadi · 2024-04-18T06:06:25Z

Do you know of anyone who, on the basis of Eirik's theoretical argument, has been harmed by it?

I'd consider myself harmed, due to the confusion such an API causes, and the bizarre sort of dual view where there can be different equality concepts.

--

Yup, giving this more thought, i really think that API was def a mistake. Even the use case is much better served by using a Dictionary. It's clear, reasonable, and already about what Dictionarys are intended for (mapping from one domain to another). That the key and value domains overlap in terms of their types, that they have different equality concepts (IEquatable vs Identity) means they should be separated out and not collapsed into a single entity to me.

CyrusNajmabadi · 2024-04-18T06:09:45Z

Note: i'm not opposed to a back-door for perf. But i would keep well outside of the mainline api as this just feels very broken.

theodorzoulias · 2024-04-18T06:29:29Z

I'd consider myself harmed, due to the confusion such an API causes, and the bizarre sort of dual view where there can be different equality concepts.

--

Yup, giving this more thought, i really think that API was def a mistake. Even the use case is much better served by using a Dictionary. It's clear, reasonable, and already about what Dictionarys are intended for (mapping from one domain to another). That the key and value domains overlap in terms of their types, that they have different equality concepts (IEquatable vs Identity) means they should be separated out and not collapsed into a single entity to me.

@CyrusNajmabadi well, you can certainly substitute hashsets with dictionaries in most everyday scenarios. If you need a HashSet<T> for lookups, and not for its specialized APIs (IntersectWith, IsSubsetOf etc), then just use a Dictionary<T, object>, ignore the TValue and call it a day. Bigger memory footprint, but one less collection to have in mind. Not only you will be less harmed by the TryGetValue, but also by the HashSet<T> collection as a whole.

(I am half-joking here, because I think that you are half-joking as well.)

CyrusNajmabadi · 2024-04-18T06:34:44Z

I'm not joking at all. I think sets should be used only when everything is treated uniformly in the domain of hte equality contract picked for the collection. If you need differing equality contracts, use a dictionary, it properly models that these are distinct spaces.

As i mentioned already, i would not be opposed to a back-door api here if absolute perf is needed. But i would not make it part of the core surface contract of HashSet.

and not for its specialized APIs (IntersectWith, IsSubsetOf etc)

If i have such a set with such strange equality concepts, i would be very wary about using any of those apis. I would def want to spell out precisely what semantics i was expecting, and would likely prefer a dictionary here to keep things clear.

theodorzoulias · 2024-04-18T07:06:06Z

I'm not joking at all. I think sets should be used only when everything is treated uniformly in the domain of hte equality contract picked for the collection. If you need differing equality contracts, use a dictionary, it properly models that these are distinct spaces.

As i mentioned already, i would not be opposed to a back-door api here if absolute perf is needed. But i would not make it part of the core surface contract of HashSet.

@CyrusNajmabadi the concept of embedding a key inside the items stored in a collection is as old as the KeyedCollection<TKey,TItem>. Which was released with the .NET Framework 2.0, so it's 18+ years old. Some people, myself included, like the concept but not the implementation, so we are re-implementing it using the most lightweight collection suitable for the task, which is the HashSet<T>. To do this we need the TryGetValue, and since your are not joking about being harmed by the existence of this API, I feel obliged to apologize on behalf of all the developers who have used it. Also I appreciate that you support the quest for CollectionsMarshal support for the HashSet<T> collection. There are many obstacles to overcome before this support is materialized, with the first being that the relevant API proposal #82840 is currently locked.

CyrusNajmabadi · 2024-04-18T07:13:51Z

There are many obstacles to overcome before this support is materialized

That's sometimes how things just fall out. Trust me, there are probably around 500 apis i'd love that would make my life materially better. But i'm not going to be able to adequately justify those apis happening in the core runtime apis themselves. That's life.

Which was released with the .NET Framework 2.0, so it's 18+ years old.

Yes. For context, i've worked on .Net since before it was 1.0. I'm very familiar with its history (and its massive warts). I don't look at things like that as any sort of good argument for continuing with such decisions in the future. As mentioned before, i want us to look to the past to see what we can learn from it, esp. about what not to do.

so we are re-implementing it using the most lightweight collection suitable for the task, which is the HashSet

Sure. But, like with all api requests, just because you have a need/desire here, is only part of the equation. Every API request ever is driven by people trying to solve problems, where they've decided to use some API and have found it lacking. The goal of the runtime is to get in that feedback, and weigh all the factors involved, to make the best choices for the needs of the ecosystem as a whole. And, a lot of the time, that means that people wanting those api changes will not get what they want. In that case you have to decide how impactful that is to you. If it's something you can live with, great. If not, that sometimes means you have to build your own solution.

As someone who has built easily hundreds of APIs (some of which did end up rolling into the BCL, but most of which did not), i'm very familiar with this :)

eiriktsarpalis · 2024-04-18T09:17:30Z

Sorry, I meant dedup when, for instance, deserializing. Not dedup for the purposes of "unique" in an array or result set.

@kellypleahy fully deserializing an object only to later discard it on the basis of duplication sounds inefficient to me. The cost of double lookup should be multiple orders of magnitude smaller compared to deserialization so a GetOrAdd method wouldn't move the needle much.

Do you know of anyone who, on the basis of Eirik's theoretical argument, has been harmed by it?

You keep repeating that my argument is theoretical, but bugs related to inappropriate use of equality are very real and very common. In concrete terms, this pattern encourages definitions of equality that only make sense when scoped to one particular hash table. This creates problems because, generally speaking, we encourage that types have universal equality semantics (e.g. via the CA2231 and CA1067 analyzers). Assuming that equality uses projection semantics (e.g. like the Item class cited by the OP) then the outcome of the == operator is bound to be misused in other contexts. Things become even trickier if the type additional needs to implement comparison semantics, because it will need to be kept compatible with equality.

If there's a broader takeaway to be made here, equality is hard and you should avoid implementing it manually unless there is a very compelling reason to do it. The language(s) and framework provide tools that let you (mostly) delegate that responsibility: I personally use tuples and records whenever I have a need to represent structural equality for complex data:

items.DistinctBy(item => (item.Date, item.Amount));

public sealed class Item // reference equality
{
    public DateTime Date { get; }
    public int Amount { get; }
    public List<int> RelatedInfo { get; }
}

the concept of embedding a key inside the items stored in a collection is as old as the KeyedCollection<TKey,TItem>. Which was released with the .NET Framework 2.0, so it's 18+ years old.

The BCL is an aggregate of over 20 years of shifting programming paradigms, patterns and, yes, in some cases mistakes. Some APIs have stood the test of time, others not so much. It does not follow that we should be doubling down on a particular pattern just because an instance of it got shipped at some point in time. Neither does not aggressively removing/deprecating APIs deemed obsolete by one member of the BCL team betray some sort of hypocricy or double standard as you're suggesting: it's simply not worth doing from a cost-benefit analysis for most cases.

theodorzoulias · 2024-04-18T10:37:53Z

@eiriktsarpalis how many bugs do you believe have been caused by the introduction of the TryGetValue API, that resulted in defective lookup functionality in production code, since the August of 2017 when the API was first released with .NET Core 2.0? Furthermore, how many bugs do you believe will be prevented in the future by blocking the GetOrAdd API proposal here? I can only guess the answer for the first question, but I am pretty sure about the second. Exactly zero bugs will be prevented, because no one who is thinking to switch from a Dictionary<K,V> to a HashSet<T> for whatever reason, will see the lack of the GetOrAdd as a show-stopper. The functionality of the GetOrAdd can be trivially implemented with the combination TryGetValue+Add, which is good enough for every imaginable scenario. It's just not perfectly efficient. So all this discussion here about equality definitions etc seems to me that is in vain.

I should mention that not only the HashSet<T> collection, but also the ImmutableHashSet<T>, ImmutableSortedSet<T>, SortedSet<T> and FrozenSet<T> are equipped with TryGetValue methods with T equalValue, out T actualValue parameters. I didn't know that, I just learned it by reading the #20073 where the TryGetValue was approved (on Feb 7, 2017), and then doing a search in the docs. Which makes the whole fight against multiple equality definitions even more futile IMHO. Isn't it obvious that the ship has sailed on this?

eiriktsarpalis · 2024-04-18T11:25:10Z

no one who is thinking to switch from a Dictionary<K,V> to a HashSet for whatever reason, will see the lack of the GetOrAdd as a show-stopper.

I don't think anybody in this thread expressed interest in preventing users from using HashSet to their heart's content. The concern that actually has been expressed is that we don't introduce new methods to popular generic types willy-nilly, particularly ones that are of questionable utility and can be worked around trivially as you are pointing out.

theodorzoulias · 2024-04-18T12:08:07Z

I don't think anybody in this thread expressed interest in preventing users from using HashSet to their heart's content.

Eirik I interpreted this previous comment:

In my opinion, types whose equality doesn't imply interchangeability gives a strong indication that the values should be indexed by a different key type using a dictionary.

...as a desire to discourage developers from making creative use of the HashSet<T> collection, when a semantically straightforward Dictionary<K,V>-based solution exists. Βeing opposed to the TryGetValue API, and hypothetically going back in time and preventing it from being approved, would certainly be a strong discouragement. Without it the opportunities for using creatively the HashSet<T> would be greatly diminished.

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 5.0 milestone Jan 31, 2020

layomia modified the milestones: 5.0.0, Future Jun 24, 2020

eiriktsarpalis assigned layomia Oct 14, 2020

eiriktsarpalis closed this as completed Oct 29, 2021

eiriktsarpalis mentioned this issue Nov 1, 2021

Add ImmutableInterlocked.GetOrAdd overloads for immutable sets #27979

Open

ghost locked as resolved and limited conversation to collaborators Nov 28, 2021

theodorzoulias mentioned this issue Mar 1, 2023

Add CollectionsMarshal support for the HashSet<T> collection #82840

Closed

bairog mentioned this issue Mar 2, 2023

Howto implement HashSet.GetOrAdd(T) extension method in .NET 6 or .NET 7 using CollectionsMarshal #82875

Closed

eiriktsarpalis mentioned this issue Apr 15, 2024

[API Proposal]: New method HashSet<T>.AddOrUpdate that replaces an existing element if found #101003

Closed

dotnet unlocked this conversation Apr 16, 2024

eiriktsarpalis closed this as not planned Won't fix, can't repro, duplicate, stale Apr 18, 2024

github-actions bot locked and limited conversation to collaborators May 19, 2024

HashSet.GetOrAdd(T) #21603

HashSet.GetOrAdd(T) #21603

Comments

jnm2 commented May 9, 2017 • edited Loading

Proposed API

jnm2 commented May 9, 2017 • edited Loading

svick commented May 10, 2017 • edited Loading

jnm2 commented May 10, 2017

danmoseley commented May 10, 2017

jnm2 commented May 10, 2017

TylerBrinkley commented May 10, 2017 • edited Loading

kellypleahy commented Jun 8, 2017 • edited Loading

danmoseley commented Aug 19, 2018

jnm2 commented Aug 19, 2018

danmoseley commented Aug 19, 2018

jnm2 commented Aug 19, 2018

kellypleahy commented Aug 19, 2018

danmoseley commented Aug 20, 2018

jkotas commented Aug 20, 2018

danmoseley commented Aug 20, 2018

eiriktsarpalis commented Nov 10, 2020

eiriktsarpalis commented Mar 2, 2023

kellypleahy commented Apr 17, 2024

eiriktsarpalis commented Apr 17, 2024

theodorzoulias commented Apr 17, 2024 • edited Loading

kellypleahy commented Apr 17, 2024

theodorzoulias commented Apr 17, 2024

TylerBrinkley commented Apr 17, 2024

kellypleahy commented Apr 17, 2024

theodorzoulias commented Apr 17, 2024

theodorzoulias commented Apr 17, 2024

eiriktsarpalis commented Apr 17, 2024

kellypleahy commented Apr 17, 2024

CyrusNajmabadi commented Apr 18, 2024

theodorzoulias commented Apr 18, 2024

CyrusNajmabadi commented Apr 18, 2024 • edited Loading

theodorzoulias commented Apr 18, 2024

CyrusNajmabadi commented Apr 18, 2024

CyrusNajmabadi commented Apr 18, 2024

theodorzoulias commented Apr 18, 2024

CyrusNajmabadi commented Apr 18, 2024 • edited Loading

CyrusNajmabadi commented Apr 18, 2024

theodorzoulias commented Apr 18, 2024

CyrusNajmabadi commented Apr 18, 2024 • edited Loading

theodorzoulias commented Apr 18, 2024

CyrusNajmabadi commented Apr 18, 2024

eiriktsarpalis commented Apr 18, 2024

theodorzoulias commented Apr 18, 2024

eiriktsarpalis commented Apr 18, 2024

theodorzoulias commented Apr 18, 2024

jnm2 commented May 9, 2017 •

edited

Loading

jnm2 commented May 9, 2017 •

edited

Loading

svick commented May 10, 2017 •

edited

Loading

TylerBrinkley commented May 10, 2017 •

edited

Loading

kellypleahy commented Jun 8, 2017 •

edited

Loading

theodorzoulias commented Apr 17, 2024 •

edited

Loading

CyrusNajmabadi commented Apr 18, 2024 •

edited

Loading

CyrusNajmabadi commented Apr 18, 2024 •

edited

Loading

CyrusNajmabadi commented Apr 18, 2024 •

edited

Loading