Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decision on the behavior of map #4

Open
bkamins opened this issue Dec 5, 2021 · 17 comments
Open

Decision on the behavior of map #4

bkamins opened this issue Dec 5, 2021 · 17 comments

Comments

@bkamins
Copy link

bkamins commented Dec 5, 2021

As discussed in JuliaData/PooledArrays.jl#76 and JuliaData/DataFrames.jl#2957 here is the current behavior of map on SparseVector:

julia> x = spzeros(5)
5-element SparseVector{Float64, Int64} with 0 stored entries

julia> map(_ -> rand(), x)
5-element SparseVector{Float64, Int64} with 5 stored entries:
  [1]  =  0.979672
  [2]  =  0.979672
  [3]  =  0.979672
  [4]  =  0.979672
  [5]  =  0.979672

The question is if this is expected or a bug (my opinion is that it is a bug).

CC @nalimilan @ParadaCarleton

@nalimilan
Copy link
Contributor

So basically this behavior assumes that the function passed to map is pure. I realize this is necessary to be able to return a sparse array. But it's problematic given that this isn't documented anywhere. It would be good to document in the docstring for map whether it's expected that f is pure in general and/or for sparse arrays in particular.

In PooledArrays we recently made the opposite choice of assuming that f is not pure by default, but with a pure=false keyword argument to map which enabled optimizations. The difference in behaviors with SparseArrays isn't ideal.

@ParadaCarleton
Copy link

I will note that I think this is the correct behavior, as map is a clearly functional construct and so should only be used with pure functions.

@bkamins
Copy link
Author

bkamins commented Dec 6, 2021

I see the point of @ParadaCarleton, but then we should clearly document it as @nalimilan noted. Currently the documentation states a different contract, and e.g. DataFrames.jl relies on the current contract.
We can change the implementation in DataFrames.jl if needed, but first I would like to clarify here the position of the core devs what assumptions about the passed function map makes.

@JeffBezanson - sorry to bother you, but map is a fundamental part of the core language, so maybe have a quick opinion?

@oscardssmith
Copy link

oscardssmith commented Dec 6, 2021

I think current behavior is fine, Jeff says ugg

@tkf
Copy link
Contributor

tkf commented Dec 6, 2021

There seem to be relevant discussions in:

FWIW, I'd love to see a spec that assumes purity of fin map(f, xs) and f.(xs) but I think there are a lot of code bases doing something like

julia> rand.() .+ (1:5)
5-element Vector{Float64}:
 1.6158199752768656
 2.781538017460546
 3.929416467969739
 4.6286338994867435
 5.9563765610939985

IIRC, @mbauman is in favor this behavior.

(Maybe we should just have Parallel.map that assumes effective purity and race freedom of f and then rigorously support side-effects in Base.map. This way, we can say "We made Base.map slow in some cases for bug fixes, but you can use a new parallel map if you need speed." It's probably a bit off-topic though.)

@bkamins
Copy link
Author

bkamins commented Dec 7, 2021

Following goerch's comment on Slack:

julia> x = spzeros(0)
0-element SparseVector{Float64, Int64} with 0 stored entries

julia> map(println, x)
0.0
0-element SparseVector{Nothing, Int64} with 0 stored entries

it is clear that we should define that map on sparse vector assumes the function is pure currently.

As a minimum it should be documented. But my preference is what @tkf sugestted - make map iterate all elements of a collection (as is specified in its docstring) and either:

  • do what we do in PooledArrays.jl - i.e. add a kwarg that would allow to signal that the passed function is pure and switch to faster mode for sparse arrays;
  • or introduce a new function like suggested Parallel.map that would assume purity of the passed function (and potentially it would do many other tricks to make it fast using multiple threads that @tkf excels at 😄).

@ParadaCarleton
Copy link

ParadaCarleton commented Dec 7, 2021

As for the current documentation, it's ambiguous, but I actually lean towards the reading of it as assuming purity. "Transform collection c by applying f to each element" intuitively makes me think of f as pure, because no mention is made of the order that f is applied in. I'd prefer changing the documentation to emphasize that f must be pure.

With regards to whether it's preferable to use a pure keyword or Parallel.map, I'd rather have the pure keyword than a new function. (And I'd prefer defaulting to purity rather than impurity, given the former is a much more common use case.) I feel as though a separate Parallel.map() is going to cause a lot of slow code from people new to Julia who don't know about the separate map functions outside of base, and it'll also make it harder to overload -- anything overloading Base.map might be forced to overload Parallel.map() as well.

Making map into a weirder way to do a for loop doesn't quite sit right with me, though. Allowing the use of impure functions with map is going to cause a lot of weirdness when ordering becomes less clear, e.g. for sets or dictionaries, where results could depend on the order f is applied. Many packages allow map to be applied to unordered collections, and there's been similar proposals to allow mapping sets in base. Similar problems apply for anything that's done lazily.

In the very long run (2.0), the best solution might be to implement traits for functions, including a way to assert purity, and then throw an error when map is called with a non-pure f.

@ToucheSir
Copy link

One good (if unfortunate) reason to use map is that it generally preserves the input collection type. This can be immensely useful when working with (named)tuples, e.g:

function cumsum(t::Tuple)
  subTotal = 0
  map(t) do elem
    subTotal += elem
  end
end

If more functions gained the ability to maintain the collection type, or there were some Rust-eque into functionality for efficient accumulation/collection into things other than heap-allocated arrays, the scope of map could be tightened.

@tkf
Copy link
Contributor

tkf commented Dec 7, 2021

Another possible API is to introduce an "executor" argument like map(f, collections..., executor) like Folds.map does. Come to think of it, I'm more in favor of this than Parallel.map :)

@ParadaCarleton
Copy link

ParadaCarleton commented Dec 18, 2021

A relevant comment -- modifying the behavior of map to allow for impure functions would make it much harder to change the behavior of map and filter in the future if we wanted to address comments like this one. I believe @StefanKarpinski mentioned that he was interested in someone introducing features like these in 2.0.

In addition, removing the assumption of purity would contradict the behavior of Iterators.map, which is lazy and therefore only works with pure functions.

@bkamins
Copy link
Author

bkamins commented Jan 7, 2022

I have just realized that the same applies to broadcasting:

julia> x = spzeros(10)
10-element SparseVector{Float64, Int64} with 0 stored entries

julia> (i -> rand()).(x)
10-element SparseVector{Float64, Int64} with 10 stored entries:
  [1 ]  =  0.935886
  [2 ]  =  0.935886
  [3 ]  =  0.935886
  [4 ]  =  0.935886
  [5 ]  =  0.935886
  [6 ]  =  0.935886
  [7 ]  =  0.935886
  [8 ]  =  0.935886
  [9 ]  =  0.935886
  [10]  =  0.935886

@mbauman - is this intended? The Julia Manual states:

Julia provides broadcast, which expands singleton dimensions in array arguments to match the corresponding dimension in the other array without using extra memory, and applies the given function elementwise

@ParadaCarleton
Copy link

I'm honestly fine with pretty much any proposed behavior, but kicking the can down the road feels like the worst of both worlds. The longer we wait the more code gets written that could break, and it also makes it harder for me to write optimized code.

@bkamins
Copy link
Author

bkamins commented Feb 11, 2022

Looking at the state of this issue (in particular not being tagged as "bug" by core devs) I understand that even if something changes here it will not happen until Julia 2.0 (as changing a non-bug behavior would be breaking).
But I am not a core dev so my judgement here is not something to rely on.

@ParadaCarleton
Copy link

ParadaCarleton commented Feb 11, 2022

Oh, I agree with that, but it should at least be clearly documented whether this is a bug or not. As you mentioned, the devs not tagging it as a bug suggests to me that they think it's correct behavior (and I'd tend to agree). In that case, the documentation should clearly specify that map and broadcasting operations currently assume/require purity, and broadcasting an impure function is considered undefined behavior.

We could also add a new function that explicitly doesn't require purity (I suggest apply) that people can use without having to worry about sparse data structures or future optimizations in 2.0 causing bugs.

@ParadaCarleton
Copy link

It looks like FillArrays has been operating under the assumption that the map documentation implicitly allows you to assume purity:

I believe the consensus is that the behaviour of broadcasting is "defined" by the BroadcastStyle: there is not necessarily a guarantee that it is equivalent to [f(a[k]) for k in eachindex(a)].

I'll gladly make a PR clearly documenting this, if one of the core devs can clarify that this is indeed intentional behavior. I'm less sure if I could implement something like apply, since I don't know how map works internally or where all its methods are defined, but I think it would be simple for someone who's more familiar with the internals of Base, and I could definitely try.

@nalimilan
Copy link
Contributor

It would make sense to make a doc PR to Base to get feedback from core devs. Clearly this issue is broader than SparseArrays or any other package.

@Keno
Copy link
Contributor

Keno commented Jun 19, 2022

So basically this behavior assumes that the function passed to map is pure. I realize this is necessary to be able to return a sparse array

Note that in current julia, it is possible to ask the compiler this question using Base.infer_effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants