Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashSet.GetOrAdd(T) #21603

Closed
jnm2 opened this issue May 9, 2017 · 70 comments
Closed

HashSet.GetOrAdd(T) #21603

jnm2 opened this issue May 9, 2017 · 70 comments
Assignees
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Collections
Milestone

Comments

@jnm2
Copy link
Contributor

jnm2 commented May 9, 2017

There are many times when there is a custom equality comparison being done but I need access to the original instance.

What I'd like:

var mergedItems = new HashSet<Item>();

foreach (var item in items)
{
    mergedItems.GetOrAdd(item).RelatedInfo.Add(x);
}

public sealed class Item : IEquatable<Item>
{
    public DateTime Date { get; }
    public int Amount { get; }
    public List<int> RelatedInfo { get; }

    public bool Equals(Item other) => other != null && other.Date == Date && other.Amount == Amount;

    public override bool Equals(object obj) => obj is Item item && Equals(item);

    public override int GetHashCode() => // ...
}

Today I'm forced to use a Dictionary which duplicates storage of the key for no reason.
A dictionary is also wasteful because it has to do two hash lookups, one for TryGetValue and one for Add:

var mergedItems = new Dictionary<Item, Item>();

foreach (var item in items)
{
    if (!mergedItems.TryGetValue(item, out var otherItem))
        mergedItems.Add(item, otherItem = item);
    otherItem.RelatedInfo.Add(x);
}

This is no better; it's a lot more to maintain, unnecessarily coupled to the definition of equality, it's still duplicate storage of the key, and it's still duplicating the number of hash lookups:

var mergedItems = new Dictionary<(DateTime, int), Item>();

foreach (var item in items)
{
    if (!mergedItems.TryGetValue((item.Date, item.Amount), out var otherItem))
        mergedItems.Add((item.Date, item.Amount), otherItem = item);
    otherItem.RelatedInfo.Add(x);
}

KeyedCollection is the worst of all because in addition to the dictionary, it keeps a list I'll never use.

Proposed API

namespace System.Collections.Generic;

public class HashSet<T>
{
+   /// <summary>
+   /// Searches the set for a given element and returns the equal element it finds, if any;
+   /// otherwise, adds the element to the set.
+   /// </summary>
+   /// <param name="item">The element to search for, or to add to the set.</param>
+   /// <returns>
+   /// The equal element that the search found, or the given element that was added to the set.
+   /// </returns>
+   public T GetOrAdd(T item);
}
@jnm2
Copy link
Contributor Author

jnm2 commented May 9, 2017

Similar to https://github.com/dotnet/corefx/issues/11291 but not exactly a duplicate, since I don't want a generic parameter for the key or any other distinctions between keys and values.

@svick
Copy link
Contributor

svick commented May 10, 2017

Today I'm forced to use a Dictionary which duplicates storage of the key for no reason.

You're not, if you're on .Net Core 2.0, you can use HashSet<T>.TryGetValue() (https://github.com/dotnet/corefx/issues/15691), though it still requires two lookups.

On one hand, the addition of TryGetValue indicates that retrieving a value from a HashSet based on a equal, but distinct value makes sense. On the other hand, it means adding GetOrAdd is less important: it improves performance and convenience, but doesn't add new capabilities.

@jnm2
Copy link
Contributor Author

jnm2 commented May 10, 2017

Sweet! That was the thread I couldn't find! So implementing GetOrAdd should be a piece of cake. @TylerBrinkley? :D

@danmoseley
Copy link
Member

@jnm2 there is a code size cost to adding new API to commonly instantiated generic types which means there is a higher bar for adding them. Consensus we achieved is here https://github.com/dotnet/corefx/issues/17275#issuecomment-291636863
Given this one could be achieved by an extension method, that might be the most palatable approach.,

@jnm2
Copy link
Contributor Author

jnm2 commented May 10, 2017

The extension method still has a downside, which is the duplicate hashtable lookup. In that sense, this is a fundamental operation. Too bad it can't be the other way around, other instance methods like Contains as extension methods implemented in terms of the fundamentals.

@TylerBrinkley
Copy link
Contributor

TylerBrinkley commented May 10, 2017

I think this one falls under the first criteria @terrajobst set for #17275 which is

  1. CONSIDER making rare methods on generic types extension methods to avoid forcing the code gen to instantiate all these methods

While there is a performance hit of 2 lookups I don't think this method would be used all that much.

@kellypleahy
Copy link
Contributor

kellypleahy commented Jun 8, 2017

If we're going to do this, can we also have a GetOrAdd that takes the factory function? This is something that seems also quite valuable? Also, can we put these on Dictionary<TKey,TValue> (as extension methods) if they don't already exist too? Is there already a ticket for Dictionary<TKey,TValue>?

@danmoseley
Copy link
Member

@jnm2 can you clarify how this is not satisfied by #20073 ? I'm missing it.

@jnm2
Copy link
Contributor Author

jnm2 commented Aug 19, 2018

@danmosemsft #20073 doesn't provide any way to do GetOrAdd without a duplicate hash lookup, that I can see. It enables us to do this, but in the 'add' scenario, there is duplicate work being done in the calls to TryGetValue and Add:

public static T GetOrAdd<T>(this HashSet<T> hashSet, T equalValue)
{
    if (hashSet.TryGetValue(equalValue, out var actualValue))
    {
        return actualValue;
    }

    hashSet.Add(equalValue);
    return equalValue;
}

@TylerBrinkley makes the point above that the performance hit of these duplicate lookups may not be important enough to add the instance method (requiring a bit more JIT time per generic instantiation).

There is indeed precedent for preferring extension methods at the cost of duplicate hash lookups:

https://github.com/dotnet/corefx/blob/b9ce86fbec92349c203a4c33805cffb8ca24c10f/src/System.Collections/src/System/Collections/Generic/CollectionExtensions.cs#L25-L39

However, see the opposite point being made by @jamesqo in the Dictionary.GetOrAdd issue: https://github.com/dotnet/corefx/issues/2877#issuecomment-319244808

@danmoseley
Copy link
Member

Ah yes i forgot we made an extension method. Given that decision it sounds like this issue can be closed?
The cost of growing commonly used generic types is the code size. Dictionary is one of the most sensitive, a small change can grow hello world size significantly.

@jnm2
Copy link
Contributor Author

jnm2 commented Aug 19, 2018

@danmosemsft #15059 is about Dictionary.GetOrAdd; this one is about HashSet.GetOrAdd. As far as I'm aware there is no extension method for GetOrAdd for either one.

@kellypleahy
Copy link
Contributor

@danmosemsft I'm a bit confused here. Does this really have such a big impact on code size? I think I'd be more worried personally about the performance impact of the second lookup than a tiny bit more code for each generic instantiation, but I guess that's just me (and I guess it really depends on whether people will actually use this instance method on hot code paths). Am I understanding properly that the impact would be an additional JITed function and an additional vtable entry (one each per generic instantiation, not per instance of objects), and the cost to perform the JIT on first use of said function?

@danmoseley
Copy link
Member

@jnm2 my bad, there's of course two asks here, one was "a custom equality comparison being done but I need access to the original instance." which seems now satisfied by HashSet<T>.TryGetValue (as already observed above) and the other is to be able to "GetOrAdd", which as you say is not currently satisfied, although on Dictionary<K,V> there is TryAdd it doesn't return the value.

As for the cost of adding to the type itself (in this case, necessary to avoid dual lookups) it is not the JIT time but the code duplication. Dictionary<K,V> is instantiated many times in a typical app - once for reference types, then once for (I guess - not my area of expertise) each sized value type, and that can add up significantly. So several proposed additions to the type were either abandoned or moved to extension methods (which is unfortunate if like in this case it would be less efficient). I would expect this to be a lesser concern for HashSet since it's presumably instantiated for fewer types.

@jkotas can you remind me of the kind of size increase you were quoting for a reasonably sized method addition to Dictionary?

@jkotas
Copy link
Member

jkotas commented Aug 20, 2018

can you remind me of the kind of size increase you were quoting for a reasonably sized method addition to Dictionary?

I have been looking at the code size vs. benefit. I do not think I have said a universal number on how much it is ok to add.

@danmoseley
Copy link
Member

I meant, if I add a new 10-20 line method to Dictionary, as a rule of thumb how much does that increase the size of a typical app. I believe you gave a number but I cannot find it.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@layomia layomia modified the milestones: 5.0.0, Future Jun 24, 2020
@eiriktsarpalis
Copy link
Member

I feel like we should close this. Given we're going down the CollectionsMarshal route for #15059, we should probably consider a similar approach for HashSet. cc @GrabYourPitchforks

@eiriktsarpalis
Copy link
Member

For the record, using CollectionsMarshal for HashSet was proven to be not viable because you can't do ref returns on keys:

#82840 (comment)

That being said, consensus is that a GetOrAdd method for HashSet is not particularly valuable compared to just using HashSet.Add except for the rare scenario where you need to access the actual instance stored in the HashSet:

#82875 (comment)

@kellypleahy
Copy link
Contributor

Let me try to rephrase what I believe is problematic with this group of proposals:

When modelling a HashSet<T> where T has a certain definition of equality it is good practice to regard t1 : T and t2 : T as interchangeable (modulo mutability concerns) whenever t1.Equals(t2). This invariant is observed in all definitions of equality that the framework ships regardless of whether it uses reference or structural equality. Because T is the key of HashSet<T> you can't actually look it up unless you already have a T. But if you already have the T then

  1. There is no need to bother with delayed initialization factories such as the ones declared here and
  2. There is no need to retrieve the underlying stored T if it exists because the two should be interchangeable. This implies that bool Contains(T item) and bool Add(T item) are the only APIs needed to interrogate the hash set.

In my opinion, types whose equality doesn't imply interchangeability gives a strong indication that the values should be indexed by a different key type using a dictionary.

I think this is a very narrow view of the HashSet value. I've personally used HashSet many times where you want to "deduplicate" a bunch of reference objects that are "identical" in semantic meaning but occupying different memory. This is precisely the use case where GetOrAdd would be valuable and would greatly improve the performance. That said, I'm not sure I understand why TryGetValue doesn't do what is needed.

@eiriktsarpalis
Copy link
Member

Proving that all these people participated in a mistake is not a trivial task! 🙂

I don't intend to, this is a discussion about adding a new API and not removing and old API 🙂

The notion "two Ts that are equal by one definition are equal by all definitions" never existed for the HashSet<T>.

I never claimed that types have a unique notion of equality, what I did say is that under a given definition of equality two equal values should be considered equivalent. IEqualityComparer<T> is a tool that lets you override built-in equality, but it shouldn't invalidate that principle (in practice this can be abused).

I've personally used HashSet many times where you want to "deduplicate" a bunch of reference objects that are "identical" in semantic meaning but occupying different memory.

Same here, but the mere act of deduplication implies that you're happy to discard references up to that notion of equality, which suggests that these references are equivalent in your domain.

This is precisely the use case where GetOrAdd would be valuable and would greatly improve the performance.

How would a GetOrAdd method improve performance in a deduplication method?

@theodorzoulias
Copy link
Contributor

theodorzoulias commented Apr 17, 2024

Proving that all these people participated in a mistake is not a trivial task! 🙂

I don't intend to, this is a discussion about adding a new API and not removing and old API 🙂

@eiriktsarpalis since you are arguing against the proposed GetOrAdd API from a theoretical/philosophical/academic standpoint, and since this API is building on the precedence of the existing TryGetValue API, I think that it would be intellectually consistent to argue against both APIs. Either both are mistakes, or none. If you think that the TryGetValue was approved incorrectly, it's still possible to mitigate this mistake by obsoleting the API, hiding it from the tools, updating the documentation with warnings against using it etc.

I think that the GetOrAdd could be implemented with the help of the CollectionsMarshal like this:

public static T GetOrAdd<T>(this HashSet<T> source, T item)
{
    ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
    // In case the item is not found, it has been added to the set and assigned to the itemRef.
    return itemRef;
}

Although the CollectionsMarshal.GetItemRefOrAdd (as proposed in #82840) is unsafe, the above GetOrAdd implementation is not. The caller of the GetOrAdd cannot misuse it in a way that would corrupt the HashSet<T>. The same cannot be said for the factory-based GetOrAdd:

public static T GetOrAdd<T, TArg>(this HashSet<T> source, T item, Func<TArg, T> itemFactory, TArg factoryArgument)
{
    ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
    if (!exists)
    {
        item = itemFactory(factoryArgument);
        Debug.Assert(source.Comparer.Equals(item, itemRef));
        itemRef = item;
    }
    return itemRef;
}

This one transfers the responsibility of correctness to the caller. It's up to the itemFactory to return a T equal to the original T. Checking for equality in release mode would defeat the whole purpose of this API, which is to have comparable performance with a single call to TryGetValue or a single call to Add.

@kellypleahy
Copy link
Contributor

Although the CollectionsMarshal.GetItemRefOrAdd (as proposed in #82840) is unsafe, the above GetOrAdd implementation is not. The caller of the GetOrAdd cannot misuse it in a way that would corrupt the HashSet<T>. The same cannot be said for the factory-based GetOrAdd:

public static T GetOrAdd<T, TArg>(this HashSet<T> source, T item, Func<TArg, T> itemFactory, TArg factoryArgument)
{
    ref T itemRef = ref CollectionsMarshal.GetItemRefOrAdd(source, item, out bool exists);
    if (!exists)
    {
        item = itemFactory(factoryArgument);
        Debug.Assert(source.Comparer.Equals(item, itemRef));
        Debug.Assert(source.Comparer.GetHashCode(item) == source.Comparer.GetHashCode(itemRef));
        itemRef = item;
    }
    return itemRef;
}

This one transfers the responsibility of correctness to the caller. It's up to the itemFactory to return a T equal to the original T. Checking for equality in release mode would defeat the whole purpose of this API, which is to avoid hashing the T element more than once.

I cannot for the life of me understand why the Factory version of this GetOrAdd for HashSet would be useful. If you've already constructed the T, why do you need a factory to help you create ANOTHER one? I totally understand the value of GetOrAdd(item) though, as it serves the purpose I mentioned earlier.

@eiriktsarpalis - the performance improvement for a GetOrAdd in the dedup situation is that you don't "search for it" then "add it if you didn't find it" in two steps since these are two hashes and searches. Similar to the argument for the same thing for Dictionary.

@theodorzoulias
Copy link
Contributor

@kellypleahy the purpose of the factory-based GetOrAdd is to avoid an allocation in case the T is reference type. For context see this previous comment.

@TylerBrinkley
Copy link
Contributor

I think the difference here with TryGetValue vs GetOrAdd is that TryGetValue unlocked functionality that was not possible before but this is only for performance and convenience reasons which doesn't seem as compelling given the static footprint costs.

@kellypleahy
Copy link
Contributor

@kellypleahy the purpose of the factory-based GetOrAdd is to avoid an allocation in case the T is reference type. For context see this previous comment.

Hmm... that one seems really bad to me, but I guess that's just my opinion. I'm 100% +1 on GetOrAdd(item). Much more dubious on GetOrAdd<TArg>(item, Func<TArg, T> factory, TArg factoryArg). I can only imagine that being confusing at best. If you really need to "initialize" the object that you passed to item, I think it should be reasonably easy for you to know that the item you passed was the one added (maybe object.ReferenceEquals() or something). I do see the use case you are talking about, but I feel like it would be mostly just confusing to callers and handles a VERY rare case and may not even provide a performance boost at all over the double lookups/insert (I'm not sure I have the brainpower to confirm or deny a benefit at the moment).

@theodorzoulias
Copy link
Contributor

@TylerBrinkley in case you have any idea about what is the static footprint cost of adding a method in the HashSet<T> collection, please share it because I would really love to know. Otherwise, how are you able to make a judgment? On the one hand we have performance and convenience, and on the other hand we have a mystery cost. That's hardly enough data for judging anything.

@theodorzoulias
Copy link
Contributor

@kellypleahy in case the item already exists in the set, the GetOrAdd will return the existing item, not the item passed as argument. There is a different proposal about a method AddOrUpdate that always adds the item argument in the set, replacing and discarding any existing element.

@eiriktsarpalis
Copy link
Member

the performance improvement for a GetOrAdd in the dedup situation is that you don't "search for it" then "add it if you didn't find it" in two steps since these are two hashes and searches. Similar to the argument for the same thing for Dictionary.

I don't understand. There's no need to make multiple lookups when deduplicating with today's APIs:

IEnumerable<T> Dedup<T>(IEnumerable<T> source)
{
    HashSet<T> set = new();
    foreach (T item in source)
    {
        if (set.Add(item)) yield return item;
    }
}

@kellypleahy
Copy link
Contributor

the performance improvement for a GetOrAdd in the dedup situation is that you don't "search for it" then "add it if you didn't find it" in two steps since these are two hashes and searches. Similar to the argument for the same thing for Dictionary.

I don't understand. There's no need to make multiple lookups when deduplicating with today's APIs:

IEnumerable<T> Dedup<T>(IEnumerable<T> source)
{
    HashSet<T> set = new();
    foreach (T item in source)
    {
        if (set.Add(item)) yield return item;
    }
}

Sorry, I meant dedup when, for instance, deserializing. Not dedup for the purposes of "unique" in an array or result set. In other words, if you have an immutable datastructure that you are deserializing that might have many "duplicates" of leaf nodes in it (this is very common for my applications) or making sure that you don't have 100s of copies of the same string stored in a graph, then you need two lookups (because you need the item from the set).

@CyrusNajmabadi
Copy link
Member

I think that it would be intellectually consistent to argue against both APIs. Either both are mistakes, or none. If you think that the TryGetValue was approved incorrectly, it's still possible to mitigate this mistake by obsoleting the API, hiding it from the tools, updating the documentation with warnings against using it etc.

The discussion around an existing shipped API, and what should be done about it, is entirely separate from the discussion around adding new functionality.

If you want to discuss that, I would recommend a separate discussion. At best, It definitely will not help with this discussion. At worst it will definitely derail it by adding confusion and muddying the topic.

@theodorzoulias
Copy link
Contributor

@CyrusNajmabadi I have no reason to start a discussion about the TryGetValue API, because I think that it's a useful API, and without it the HashSet<T> would be incomplete. Eirik has a theoretical argument against the GetOrAdd proposal, which also applies to the existing TryGetValue API. I am mainly pointing out that the mere existence of this API weakens his argument. Eirik can't just say, or imply, that the approval of the TryGetValue was a mistake, and then carry on with his argument. I mean, he can, but then his argument will remain weak.

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Apr 18, 2024

I am mainly pointing out that the mere existence of this API weakens his argument.

It doesn't. The two are separate discussions. As erik rightly pointed out

"I don't think the presence of an API that somehow got shipped in the past refutes the point I'm trying to make."

The new API has to stand on its own merits. Erik has made reasonable claims both as to the goals and designs of this API and the very real and relevant perf concerns the runtime has here. The latter of which is extremely important given hte size of hte ecosystem here and how impacted they are by that. This means there is a very high bar to clear, despite how desirable you might want this.

Furthermore, i'll go back to my previous point. If the perf concerns of the runtime are not acceptable to you as part of the decision makign process, then the same applies in reverse. If we take perf out, then just use a dictionary, or just use an extension that makes multiple calls. If these are unpalatable to you because of the perf costs they will have, then please recognize the same concerns apply to the runtime for a the much larger and much more perf sensitive ecosystem they are targetting.

I have no reason to start a discussion about the TryGetValue API, because I think that it's a useful API

Ok. Then that should be tabled. It's presence isn't part of the discussion here or the argument as to whether or not GetOrAdd be added.

@theodorzoulias
Copy link
Contributor

@CyrusNajmabadi the HashSet<T> is already shaped by the existence of the TryGetValue. Discussing whether the HashSet<T> should acknowledge differing definitions of equality is a moot discussion. It already does. And thankfully so.

If the perf concerns of the runtime are not acceptable to you as part of the decision making process, then the same applies in reverse.

I never said that it's not acceptable. I said that it's depressing. I wish that the static footprint issue had been addressed with some clever technical solution, instead of addressing it by restricting the evolution of the APIs.

@CyrusNajmabadi
Copy link
Member

It already does.

As already stated: "it shouldn't be carte blanche for us to follow up with more mistakes."

Many API authors have intimate experience with this. Your past isn't perfect, and there are things you wish you hadn't done. But you learn from that and use that to inform future decisions. That means you pay more attention and are more careful. The experience with mistakes motivates you to go away from that space, not more toward it.

@CyrusNajmabadi
Copy link
Member

I never said that it's not acceptable. I said that it's depressing.

I'm not sure what can be done about that. However, it's part and parcel of every api decision made practically everywhere that their are constraints you have to work within. That's just life :) Maybe the constraints change in the future and you can revisit some decisions.

@theodorzoulias
Copy link
Contributor

As already stated: "it shouldn't be carte blanche for us to follow up with more mistakes."

And why exactly was the TryGetValue a mistake? It is certainly benefiting the developers who are searching for it, finding it, and using it. Do you know of anyone who, on the basis of Eirik's theoretical argument, has been harmed by it? This question is relevant to our discussion about the GetOrAdd, because if the TryGetValue hasn't harmed anyone, neither the GetOrAdd will. These two APIs are interconnected. Currently the only way to implement the GetOrAdd (inefficiently though), is to search for the element with the TryGetValue, and if it's not found then Add it to the set.

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Apr 18, 2024

Do you know of anyone who, on the basis of Eirik's theoretical argument, has been harmed by it?

I'd consider myself harmed, due to the confusion such an API causes, and the bizarre sort of dual view where there can be different equality concepts.

--

Yup, giving this more thought, i really think that API was def a mistake. Even the use case is much better served by using a Dictionary. It's clear, reasonable, and already about what Dictionarys are intended for (mapping from one domain to another). That the key and value domains overlap in terms of their types, that they have different equality concepts (IEquatable vs Identity) means they should be separated out and not collapsed into a single entity to me.

@CyrusNajmabadi
Copy link
Member

Note: i'm not opposed to a back-door for perf. But i would keep well outside of the mainline api as this just feels very broken.

@theodorzoulias
Copy link
Contributor

I'd consider myself harmed, due to the confusion such an API causes, and the bizarre sort of dual view where there can be different equality concepts.

--

Yup, giving this more thought, i really think that API was def a mistake. Even the use case is much better served by using a Dictionary. It's clear, reasonable, and already about what Dictionarys are intended for (mapping from one domain to another). That the key and value domains overlap in terms of their types, that they have different equality concepts (IEquatable vs Identity) means they should be separated out and not collapsed into a single entity to me.

@CyrusNajmabadi well, you can certainly substitute hashsets with dictionaries in most everyday scenarios. If you need a HashSet<T> for lookups, and not for its specialized APIs (IntersectWith, IsSubsetOf etc), then just use a Dictionary<T, object>, ignore the TValue and call it a day. Bigger memory footprint, but one less collection to have in mind. Not only you will be less harmed by the TryGetValue, but also by the HashSet<T> collection as a whole.

(I am half-joking here, because I think that you are half-joking as well.)

@CyrusNajmabadi
Copy link
Member

CyrusNajmabadi commented Apr 18, 2024

I'm not joking at all. I think sets should be used only when everything is treated uniformly in the domain of hte equality contract picked for the collection. If you need differing equality contracts, use a dictionary, it properly models that these are distinct spaces.

As i mentioned already, i would not be opposed to a back-door api here if absolute perf is needed. But i would not make it part of the core surface contract of HashSet.

and not for its specialized APIs (IntersectWith, IsSubsetOf etc)

If i have such a set with such strange equality concepts, i would be very wary about using any of those apis. I would def want to spell out precisely what semantics i was expecting, and would likely prefer a dictionary here to keep things clear.

@theodorzoulias
Copy link
Contributor

I'm not joking at all. I think sets should be used only when everything is treated uniformly in the domain of hte equality contract picked for the collection. If you need differing equality contracts, use a dictionary, it properly models that these are distinct spaces.

As i mentioned already, i would not be opposed to a back-door api here if absolute perf is needed. But i would not make it part of the core surface contract of HashSet.

@CyrusNajmabadi the concept of embedding a key inside the items stored in a collection is as old as the KeyedCollection<TKey,TItem>. Which was released with the .NET Framework 2.0, so it's 18+ years old. Some people, myself included, like the concept but not the implementation, so we are re-implementing it using the most lightweight collection suitable for the task, which is the HashSet<T>. To do this we need the TryGetValue, and since your are not joking about being harmed by the existence of this API, I feel obliged to apologize on behalf of all the developers who have used it. Also I appreciate that you support the quest for CollectionsMarshal support for the HashSet<T> collection. There are many obstacles to overcome before this support is materialized, with the first being that the relevant API proposal #82840 is currently locked.

@CyrusNajmabadi
Copy link
Member

There are many obstacles to overcome before this support is materialized

That's sometimes how things just fall out. Trust me, there are probably around 500 apis i'd love that would make my life materially better. But i'm not going to be able to adequately justify those apis happening in the core runtime apis themselves. That's life.

Which was released with the .NET Framework 2.0, so it's 18+ years old.

Yes. For context, i've worked on .Net since before it was 1.0. I'm very familiar with its history (and its massive warts). I don't look at things like that as any sort of good argument for continuing with such decisions in the future. As mentioned before, i want us to look to the past to see what we can learn from it, esp. about what not to do.

so we are re-implementing it using the most lightweight collection suitable for the task, which is the HashSet

Sure. But, like with all api requests, just because you have a need/desire here, is only part of the equation. Every API request ever is driven by people trying to solve problems, where they've decided to use some API and have found it lacking. The goal of the runtime is to get in that feedback, and weigh all the factors involved, to make the best choices for the needs of the ecosystem as a whole. And, a lot of the time, that means that people wanting those api changes will not get what they want. In that case you have to decide how impactful that is to you. If it's something you can live with, great. If not, that sometimes means you have to build your own solution.

As someone who has built easily hundreds of APIs (some of which did end up rolling into the BCL, but most of which did not), i'm very familiar with this :)

@eiriktsarpalis
Copy link
Member

Sorry, I meant dedup when, for instance, deserializing. Not dedup for the purposes of "unique" in an array or result set.

@kellypleahy fully deserializing an object only to later discard it on the basis of duplication sounds inefficient to me. The cost of double lookup should be multiple orders of magnitude smaller compared to deserialization so a GetOrAdd method wouldn't move the needle much.

Do you know of anyone who, on the basis of Eirik's theoretical argument, has been harmed by it?

You keep repeating that my argument is theoretical, but bugs related to inappropriate use of equality are very real and very common. In concrete terms, this pattern encourages definitions of equality that only make sense when scoped to one particular hash table. This creates problems because, generally speaking, we encourage that types have universal equality semantics (e.g. via the CA2231 and CA1067 analyzers). Assuming that equality uses projection semantics (e.g. like the Item class cited by the OP) then the outcome of the == operator is bound to be misused in other contexts. Things become even trickier if the type additional needs to implement comparison semantics, because it will need to be kept compatible with equality.

If there's a broader takeaway to be made here, equality is hard and you should avoid implementing it manually unless there is a very compelling reason to do it. The language(s) and framework provide tools that let you (mostly) delegate that responsibility: I personally use tuples and records whenever I have a need to represent structural equality for complex data:

items.DistinctBy(item => (item.Date, item.Amount));

public sealed class Item // reference equality
{
    public DateTime Date { get; }
    public int Amount { get; }
    public List<int> RelatedInfo { get; }
}

the concept of embedding a key inside the items stored in a collection is as old as the KeyedCollection<TKey,TItem>. Which was released with the .NET Framework 2.0, so it's 18+ years old.

The BCL is an aggregate of over 20 years of shifting programming paradigms, patterns and, yes, in some cases mistakes. Some APIs have stood the test of time, others not so much. It does not follow that we should be doubling down on a particular pattern just because an instance of it got shipped at some point in time. Neither does not aggressively removing/deprecating APIs deemed obsolete by one member of the BCL team betray some sort of hypocricy or double standard as you're suggesting: it's simply not worth doing from a cost-benefit analysis for most cases.

@theodorzoulias
Copy link
Contributor

@eiriktsarpalis how many bugs do you believe have been caused by the introduction of the TryGetValue API, that resulted in defective lookup functionality in production code, since the August of 2017 when the API was first released with .NET Core 2.0? Furthermore, how many bugs do you believe will be prevented in the future by blocking the GetOrAdd API proposal here? I can only guess the answer for the first question, but I am pretty sure about the second. Exactly zero bugs will be prevented, because no one who is thinking to switch from a Dictionary<K,V> to a HashSet<T> for whatever reason, will see the lack of the GetOrAdd as a show-stopper. The functionality of the GetOrAdd can be trivially implemented with the combination TryGetValue+Add, which is good enough for every imaginable scenario. It's just not perfectly efficient. So all this discussion here about equality definitions etc seems to me that is in vain.

I should mention that not only the HashSet<T> collection, but also the ImmutableHashSet<T>, ImmutableSortedSet<T>, SortedSet<T> and FrozenSet<T> are equipped with TryGetValue methods with T equalValue, out T actualValue parameters. I didn't know that, I just learned it by reading the #20073 where the TryGetValue was approved (on Feb 7, 2017), and then doing a search in the docs. Which makes the whole fight against multiple equality definitions even more futile IMHO. Isn't it obvious that the ship has sailed on this?

@eiriktsarpalis
Copy link
Member

no one who is thinking to switch from a Dictionary<K,V> to a HashSet for whatever reason, will see the lack of the GetOrAdd as a show-stopper.

I don't think anybody in this thread expressed interest in preventing users from using HashSet to their heart's content. The concern that actually has been expressed is that we don't introduce new methods to popular generic types willy-nilly, particularly ones that are of questionable utility and can be worked around trivially as you are pointing out.

@eiriktsarpalis eiriktsarpalis closed this as not planned Won't fix, can't repro, duplicate, stale Apr 18, 2024
@theodorzoulias
Copy link
Contributor

I don't think anybody in this thread expressed interest in preventing users from using HashSet to their heart's content.

Eirik I interpreted this previous comment:

In my opinion, types whose equality doesn't imply interchangeability gives a strong indication that the values should be indexed by a different key type using a dictionary.

...as a desire to discourage developers from making creative use of the HashSet<T> collection, when a semantically straightforward Dictionary<K,V>-based solution exists. Βeing opposed to the TryGetValue API, and hypothetically going back in time and preventing it from being approved, would certainly be a strong discouragement. Without it the opportunities for using creatively the HashSet<T> would be greatly diminished.

@github-actions github-actions bot locked and limited conversation to collaborators May 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Collections
Projects
None yet
Development

No branches or pull requests