Introduce in!(x, s::Set) to improve performance of unique() #45156

petvana · 2022-05-02T20:26:36Z

This PR introduces in! function not to compute ht_keyindex twice. It is not exported yet, but the fallback would be:

in!(x, s::AbstractSet) = in(x, s) ? true : (push!(s, x); false)

Master

julia> using Random, BenchmarkTools

julia> @btime unique($([randstring(20) for i in 1:10]));
  483.954 ns (7 allocations: 848 bytes)

julia> @btime allunique($([randstring(20) for i in 1:10]));
  270.550 ns (4 allocations: 384 bytes)

PR

julia> @btime unique($([randstring(20) for i in 1:10]));
  354.769 ns (7 allocations: 848 bytes)

julia> @btime allunique($([randstring(20) for i in 1:10]));
  270.236 ns (4 allocations: 384 bytes)

(allunique has been already optimized, but with Dict.)

tkf · 2022-05-03T06:02:19Z

Regarding in!, I think it'd be nice if we can keep it internal and explore we more generic and extensible abstract interface. FWIW, I personally prefer addressing #33758 and #45080 for solving this. #33758 (implemented in: PreludeDicts.modify!) lets us define in!-like functions for arbitrary set types derived from dict types. #45080 lets us implement an interface like PreludeDicts.tryinsert! that is more generic; e.g., it can be used for (string) interning.

tkf

The patch itself LGTM

petvana · 2022-05-03T07:47:35Z

Initially, I was inspired by get! function for Dict, and I haven't found anything like this for Set to avoid hashing twice. There are two reasons why I selected in!. First, it is straightforward, short, and easy to understand. Secondly, !in! looks so imperative in the code. 😎

julia/base/set.jl

Lines 149 to 151 in 709daeb

    
           for x in itr 
        
               !in!(x, seen) && push!(out, x) 
        
           end

jakobnissen · 2022-05-03T10:54:37Z

This also ties into the long-standing issue of providing token-based API for dictionaries (#24454), which would also expose an interface to solve this problem.
FWIW, this is the default behaviour of Rust's HashSet. Not that we should copy them, but it lends credence to it being a reasonable function to have.

petvana · 2022-05-03T12:53:04Z

Token-based API seems best in the long term. Here, I have two minor comments. It would be nice to store age from Dict in the token and check if the user is trying to shoot themself into a foot by using an outdated token. Secondly, the same token can be used for iterating over the collection to prevent the user to insert elements while iterating because it may be hard-to-debug (and is not well documented).

Meanwhile, the PR seems ready to be merged.

tkf · 2022-05-03T14:51:42Z

Functional API (modify!) is "better" than token-based API in the sense that the latter can be derived from the former while the former enables concurrent dictionary implementation. This is discussed in Comparison to token-based API.

I think `in!` is a useful general function for users, and would be good to have as official API. Its semantics is clear and unambiguous, while providing a clear performance advantage over the naive implementation. For more evidence that this functionality is useful, consider: * Rust's `HashSet::insert` works just like this implementation of `in!` * This function was already used in the implementation of `Base.unique`, precisely for the performance over the naive approach Comes from #45156 with some initial discussion.

I think `in!` is a useful general function for users, and would be good to have as official API. Its semantics is clear and unambiguous, while providing a clear performance advantage over the naive implementation. For more evidence that this functionality is useful, consider: * Rust's `HashSet::insert` works just like this implementation of `in!` * This function was already used in the implementation of `Base.unique`, precisely for the performance over the naive approach Comes from JuliaLang#45156 with some initial discussion.

Introduce in!(x, s::Set) to improve performance of unique()

d3ff474

petvana added collections Data structures holding multiple items, e.g. sets performance Must go faster labels May 2, 2022

tkf approved these changes May 3, 2022

View reviewed changes

KristofferC approved these changes May 3, 2022

View reviewed changes

Comment on what in! does

709daeb

KristofferC merged commit 9a2f5ae into JuliaLang:master May 3, 2022

petvana mentioned this pull request Apr 23, 2023

Implement pop! for AbstractSet and remove some implementations of pop! for concrete sets #49463

Draft

petvana mentioned this pull request Oct 8, 2023

Document and export Base.in! #51636

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce in!(x, s::Set) to improve performance of unique() #45156

Introduce in!(x, s::Set) to improve performance of unique() #45156

petvana commented May 2, 2022 •

edited

Loading

tkf commented May 3, 2022

tkf left a comment

petvana commented May 3, 2022

jakobnissen commented May 3, 2022

petvana commented May 3, 2022

tkf commented May 3, 2022

Introduce in!(x, s::Set) to improve performance of unique() #45156

Introduce in!(x, s::Set) to improve performance of unique() #45156

Conversation

petvana commented May 2, 2022 • edited Loading

tkf commented May 3, 2022

tkf left a comment

Choose a reason for hiding this comment

petvana commented May 3, 2022

jakobnissen commented May 3, 2022

petvana commented May 3, 2022

tkf commented May 3, 2022

petvana commented May 2, 2022 •

edited

Loading