Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce in!(x, s::Set) to improve performance of unique() #45156

Merged
merged 2 commits into from
May 3, 2022

Conversation

petvana
Copy link
Member

@petvana petvana commented May 2, 2022

This PR introduces in! function not to compute ht_keyindex twice. It is not exported yet, but the fallback would be:

in!(x, s::AbstractSet) = in(x, s) ? true : (push!(s, x); false)

Master

julia> using Random, BenchmarkTools

julia> @btime unique($([randstring(20) for i in 1:10]));
  483.954 ns (7 allocations: 848 bytes)

julia> @btime allunique($([randstring(20) for i in 1:10]));
  270.550 ns (4 allocations: 384 bytes)

PR

julia> @btime unique($([randstring(20) for i in 1:10]));
  354.769 ns (7 allocations: 848 bytes)

julia> @btime allunique($([randstring(20) for i in 1:10]));
  270.236 ns (4 allocations: 384 bytes)

(allunique has been already optimized, but with Dict.)

@petvana petvana added collections Data structures holding multiple items, e.g. sets performance Must go faster labels May 2, 2022
@tkf
Copy link
Member

tkf commented May 3, 2022

Regarding in!, I think it'd be nice if we can keep it internal and explore we more generic and extensible abstract interface. FWIW, I personally prefer addressing #33758 and #45080 for solving this. #33758 (implemented in: PreludeDicts.modify!) lets us define in!-like functions for arbitrary set types derived from dict types. #45080 lets us implement an interface like PreludeDicts.tryinsert! that is more generic; e.g., it can be used for (string) interning.

Copy link
Member

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch itself LGTM

@petvana
Copy link
Member Author

petvana commented May 3, 2022

Initially, I was inspired by get! function for Dict, and I haven't found anything like this for Set to avoid hashing twice. There are two reasons why I selected in!. First, it is straightforward, short, and easy to understand. Secondly, !in! looks so imperative in the code. 😎

julia/base/set.jl

Lines 149 to 151 in 709daeb

for x in itr
!in!(x, seen) && push!(out, x)
end

@jakobnissen
Copy link
Contributor

This also ties into the long-standing issue of providing token-based API for dictionaries (#24454), which would also expose an interface to solve this problem.
FWIW, this is the default behaviour of Rust's HashSet. Not that we should copy them, but it lends credence to it being a reasonable function to have.

@petvana
Copy link
Member Author

petvana commented May 3, 2022

Token-based API seems best in the long term. Here, I have two minor comments. It would be nice to store age from Dict in the token and check if the user is trying to shoot themself into a foot by using an outdated token. Secondly, the same token can be used for iterating over the collection to prevent the user to insert elements while iterating because it may be hard-to-debug (and is not well documented).

Meanwhile, the PR seems ready to be merged.

@KristofferC KristofferC merged commit 9a2f5ae into JuliaLang:master May 3, 2022
@tkf
Copy link
Member

tkf commented May 3, 2022

Functional API (modify!) is "better" than token-based API in the sense that the latter can be derived from the former while the former enables concurrent dictionary implementation. This is discussed in Comparison to token-based API.

vtjnash pushed a commit that referenced this pull request Nov 25, 2023
I think `in!` is a useful general function for users, and would be good
to have as official API. Its semantics is clear and unambiguous, while
providing a clear performance advantage over the naive implementation.

For more evidence that this functionality is useful, consider:
* Rust's `HashSet::insert` works just like this implementation of `in!`
* This function was already used in the implementation of `Base.unique`,
precisely for the performance over the naive approach

Comes from #45156 with some initial discussion.
mkitti pushed a commit to mkitti/julia that referenced this pull request Dec 9, 2023
I think `in!` is a useful general function for users, and would be good
to have as official API. Its semantics is clear and unambiguous, while
providing a clear performance advantage over the naive implementation.

For more evidence that this functionality is useful, consider:
* Rust's `HashSet::insert` works just like this implementation of `in!`
* This function was already used in the implementation of `Base.unique`,
precisely for the performance over the naive approach

Comes from JuliaLang#45156 with some initial discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants