Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add lazy finite difference DiffView and fdiv #19

Closed
wants to merge 4 commits into from

Conversation

johnnychen94
Copy link
Member

@johnnychen94 johnnychen94 commented Oct 11, 2021

The purpose is to move Images.div here: JuliaImages/Images.jl#971 (comment)

This PR implements a lazy version of fdiff so that we can compose multiple pipeline together into fdiv without the need to worry about memory allocation.

A small benchmark on its performance shows that we still have some room for optimization:

using ImageFiltering
using ImageBase

laplacian(X) = fdiv(ntuple(i->DiffView(X, dims=i), ndims(X))...)
ref_laplacian(X) = imfilter(X, Kernel.Laplacian(ntuple(x->true, ndims(X))), "circular")

X = rand(Float32, 256, 256);
fdiff(X, dims=1) == DiffView(X, dims=1) # true
@btime fdiff($X, dims=1); # 18.822 μs (2 allocations: 256.05 KiB)
@btime collect(DiffView($X, dims=1)); # 151.683 μs (2 allocations: 256.05 KiB)

laplacian(X) == ref_laplacian(X) # true
@btime laplacian($X); # 612.230 μs (2 allocations: 256.05 KiB)
@btime ref_laplacian($X); # 323.276 μs (16 allocations: 521.02 KiB)

X = rand(Float32, 1024, 1024);
fdiff(X, dims=1) == DiffView(X, dims=1) # true
@btime fdiff($X, dims=1); # 326.182 μs (2 allocations: 4.00 MiB)
@btime collect(DiffView($X, dims=1)); # 2.559 ms (2 allocations: 4.00 MiB)

laplacian(X) == ref_laplacian(X) # true
@btime laplacian($X); # 9.336 ms (2 allocations: 4.00 MiB)
@btime ref_laplacian($X); # 5.720 ms (16 allocations: 8.03 MiB)

closes #20

@johnnychen94 johnnychen94 changed the title WIP: add ajoint_fdiff and div WIP: add ajoint_fdiff and fdiv Oct 11, 2021
@codecov
Copy link

codecov bot commented Oct 11, 2021

Codecov Report

Merging #19 (39058e6) into master (bd3db5b) will increase coverage by 1.74%.
The diff coverage is 97.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #19      +/-   ##
==========================================
+ Coverage   90.59%   92.33%   +1.74%     
==========================================
  Files           6        6              
  Lines         202      274      +72     
==========================================
+ Hits          183      253      +70     
- Misses         19       21       +2     
Impacted Files Coverage Δ
src/diff.jl 97.08% <97.22%> (+0.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd3db5b...39058e6. Read the comment docs.

@johnnychen94 johnnychen94 changed the base branch from master to jc/fdiff_fix October 11, 2021 20:40
@johnnychen94 johnnychen94 changed the title WIP: add ajoint_fdiff and fdiv WIP: add lazy finite difference DiffView and fdiv Oct 11, 2021
@@ -1,3 +1,94 @@
abstract type BoundaryCondition end
struct Periodic <: BoundaryCondition end
Copy link
Member Author

@johnnychen94 johnnychen94 Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked :periodic when I implemented fdiff in #11, and now I realize that imfilter uses "circular" for the same thing. Should I change this PR to use Circular()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Alternatively, should we switch imfilter to "periodic"? I suspect I may have gotten "circular" via my experience using Matlab, but the me of today seems to think that "periodic" is the more obvious choice. Thoughts?

Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic. It does seem to blur the line a bit between ImageFiltering and ImageBase, but I think this is a reasonable choice. We could even consider moving Laplacian here I suppose.

I've just left a couple of comments about potential ways to improve performance; everything else is elegant and well designed.

src/diff.jl Outdated
Comment on lines 56 to 58
i == A.dims || return p
p == first(r) && return last(r)
p - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has two branches in it per dimension, and I would bet this is the origin of the performance problems. (Branches are bad on their own and also block @simd.) If you can write this out in ternary cond ? val1 : (cond2 ? val2 : val3) form I bet you'll get substantially better performance. Ternaries get converted to LLVM select statements which are non-branching and SIMD-compatible. (It's like ifelse, except usually the ternary form generates better code.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've never seen this, it's a fun read.


The in-place version of [`fdiv`](@ref).
"""
function fdiv!(dst::AbstractArray, Vs::AbstractArray...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance, it seems possible you'll want to force specialize this with Vs::Vararg{<:AbstractArray,N}. Only the loop that computes sum is sensitive to this, so alternatively you could split that out and force-specialize just that part. But I'm not sure it's bad (i.e., unnecessarily latency-inducing) to specialize the whole thing (it's pretty short).

Base automatically changed from jc/fdiff_fix to master October 12, 2021 12:06
@johnnychen94
Copy link
Member Author

johnnychen94 commented Oct 12, 2021

I noticed that we can move a large number of if-checks to compilation time, and after applying generated functions, I get ~6x faster on the Laplacian operator.
And I have to admit it's quite surprising that it's even faster than the current imfilter-based version:

X = rand(Float32, 256, 256);
laplacian(X)  ref_laplacian(X) # true
@btime laplacian($X); # 108.335 μs (2 allocations: 256.05 KiB)
@btime ref_laplacian($X); # 319.856 μs (17 allocations: 521.05 KiB)

X = rand(Float32, 1024, 1024);
laplacian(X)  ref_laplacian(X) # true
@btime laplacian($X); # 1.792 ms (2 allocations: 4.00 MiB)
@btime ref_laplacian($X); # 6.483 ms (17 allocations: 8.03 MiB)

I'm not very convinced about the implementation, though...

Edit: Urghh... after I pushed the new commits and start a new Julia session, I no longer get the performance boost.. 😢

Maintaining old compatibility becomes quite troublesome especially when we have
1.6 as the new LTS version now.
Also add `laplacian` and `laplacian!`
@johnnychen94
Copy link
Member Author

johnnychen94 commented Oct 12, 2021

This lazy view strategy is perhaps more complicated than I thought it was. IIUC the reason that fdiff is faster than DiffView is mainly due to the fact that fdiff eagerly handles the boundaries so that in the inner region, it doesn't need to do the extra boundary check, and thus calls efficient broadcasting implementation. In the contrast, DiffView checks if given position I is on the boundary, and it makes generating @simd codes more difficult than I expect.

I also tried the cond ? val1 : (cond2 ? val2 : val3) way, but they generate the same llvm codes.

I plan to pause this PR for a while and take the plain solution in https://github.com/JuliaImages/Images.jl/blob/f50f59b68895debc6882eebc70f70ded62befdf3/src/algorithms.jl#L505-L537 in a separate PR.

@timholy
Copy link
Member

timholy commented Oct 16, 2021

No need to pick this up again, but if/when you or someone else does: one thought I had is whether the "wrap-around" boundary condition was to blame for the performance gap. That seems worth testing specifically, perhaps by temporarily disabling it and seeing if you get much better performance. If that proves true, then I wonder if it would help to add something that might trick LLVM into hoisting it. I wonder if adding a field for v = view(data, _the_slice_at_the_array_edge_along_dim...) and handle all wraparound behavior through v?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement a lazy ImageBase.fdiff array?
2 participants