Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idempotency optimization on initialized outputs #353

Open
kylebd99 opened this issue Dec 13, 2023 · 6 comments
Open

idempotency optimization on initialized outputs #353

kylebd99 opened this issue Dec 13, 2023 · 6 comments
Assignees
Milestone

Comments

@kylebd99
Copy link
Collaborator

In my project, I'm using the instance AST to programmatically build a finch kernel. But, I've been running into an issue where it sometimes needlessly densifies fibers. The following code should reproduce the issue where output_fiber is fully dense even though it should just be a copy of X which is sparse. I'm probably just not building the AST correctly, but it seems like an issue of it not recognizing the idempotency of overwrite.

using Finch
using Finch.FinchNotation: index_instance, tag_instance, variable_instance, literal_instance,
                            Reader, Updater, access_instance, assign_instance, loop_instance,
                            Dimensionless, declare_instance, block_instance

function test_fill()
    X = fsprand((1000, 1000), .5)
    indices = [index_instance(:i), index_instance(:j)]
    X_tag = tag_instance(variable_instance(:X), X)
    X_access = access_instance(X_tag, literal_instance(Reader()), indices...)
    X_instance = X_access
    output_fiber = Fiber(SparseHashLevel{1}(SparseHashLevel{1}(Element(0.0))))
    output_tag = tag_instance(variable_instance(:output_fiber), output_fiber)
    output_access = access_instance(output_tag, literal_instance(Updater()), indices...)
    full_prgm = assign_instance(output_access, overwrite, X_instance)

    for index in indices
        full_prgm = loop_instance(index, Dimensionless(), full_prgm)
    end

    initializer = declare_instance(tag_instance(variable_instance(:output_fiber), output_fiber), literal_instance(0.0))
    full_prgm = block_instance(initializer, full_prgm)

    output_fiber = Finch.execute(full_prgm).output_fiber
    println("Input Size: ", countstored(X))
    println("Output Size: ", countstored(output_fiber))
    return output_fiber
end

test_fill()
@willow-ahrens
Copy link
Collaborator

Okay, I took a look:

The first thing to say is overwrite(a, b) = b, and has no identity. initwrite(z)(a, b) = b == z ? a : b, so initwrite(z) is identity on z. Finch substitutes x[i] = y with x[i] <<initwrite(default(x))>>==y by default. In a sense, = is asserting that writing the default to a tensor will have no effect. You can get the kernel to have the expected behavior by using initwrite(0.0).

A similar issue arises in this non-reducing kernel when we use max instead of overwrite. Finch doesn't think 0 is an identity on max, so it doesn't simplify it. If the right hand side has a default of -Inf, this kernel would then work as expected.

Ultimately, these behaviors are a result of deciding not to reason, ever, about the current value held in a tensor. We don't do constant propagation through arrays, even trying to reason about whether something is the first write to a tensor after initialization.

@willow-ahrens
Copy link
Collaborator

willow-ahrens commented Dec 14, 2023

A similar issue arises during reduction: it's difficult to apply idempotent simplifications if the idempotent reduction is not in the innermost loop.

Consider the reducing kernel:

function test_fill()
           X = Fiber!(Dense(Dense(SparseList(Element(0.0)))), fsprand((100, 100, 100), .001))
           indices = [index_instance(:i), index_instance(:j), index_instance(:k)]
           X_tag = tag_instance(variable_instance(:X), X)
           X_access = access_instance(X_tag, literal_instance(Reader()), indices...)
           X_instance = X_access
           output_fiber = Fiber(SparseHashLevel{1}(SparseHashLevel{1}(Element(-Inf))))
           output_tag = tag_instance(variable_instance(:output_fiber), output_fiber)
           output_access = access_instance(output_tag, literal_instance(Updater()), indices[2], indices[3])
           full_prgm = assign_instance(output_access, literal_instance(max), X_instance)

           for index in indices
               full_prgm = loop_instance(index, Dimensionless(), full_prgm)
           end

           initializer = declare_instance(tag_instance(variable_instance(:output_fiber), output_fiber), literal_instance(-Inf))
           full_prgm = block_instance(initializer, full_prgm)

           output_fiber = Finch.execute(full_prgm).output_fiber
           println("Input Size: ", countstored(X))
           println("Output Size: ", countstored(output_fiber))
           Finch.execute_code(:prgm, typeof(full_prgm)) |>Finch.pretty |> Finch.dataflow
       end

The above code will always densify the output because it doesn't know whether a negative number has previously been written to the output entry. Also, we can only apply an idempotent simplification if the reduction is in the innermost loop.

For example, if we do:

for i = _
   for j = _
      x[i] <<max>>= y[j, i]
   end
end

then we can eliminate contiguous maxes with 0 in the innermost loop.

However, in

for i = _
   for j = _
      x[j] <<max>>= y[j, i]
   end
end

each update in the inner loop modifies a single, separate entry of x, and so no idempotency can be identified because we can't reason about the current value of the entry of x and whether it has previously been maxed with 0 or not, so we have to do all the updates, just to be safe.

I think that there is likely no way to get Finch to apply the idempotent operation unless the reduction is the innermost loop.

However, I do think you could keep from densifying the output by defining a similar operation to initwrite for max. You could define initmax(z)(a, b) = b <= z ? a : max(a, b), which you could correctly declare to have the property that isidentity(initmax(z), z) = true.

@willow-ahrens
Copy link
Collaborator

As far as resolving this issue goes, what do you think we need to do here? Some options:

  • Describe this in the docs
  • Issue a warning when the output is getting densified
  • Issue a warning when the operations are densified

On the second two points, I think that we should just ask the user if they intended the kernel to be sparse. However, I'm open to suggestions about how to detect that it became dense @wraith1995 @nullplay.

@kylebd99
Copy link
Collaborator Author

Thanks for looking into this and for the detailed explanation. The behavior makes a lot more sense now.

This may be a bit of a pain/not feasible in the rewrite infrastructure, but I think that reasoning about the initialization value of x would be enough to capture some of the big cases. If x initialized with default(y) and f is abelian (which seems like a fairly common case), then the idempotency optimization should apply to arbitrary loop nests. Under these assumptions, I think that

x .= default(y) 
for i = _
   for j = _
      x[i] <<f>>= y[j, i]
   end
end

is equivalent to

x .= default(y) 
for i = _
   for j = _
      x[i] <<f>>= f(y[j, i], default(y))
   end
end

So we can always assume that x[i] has been "maxed with the default value". This requires analyzing initialization values, so I'm not sure if its worth it per se. But, I think that it avoids reasoning about "current values'"?

On warnings, I lean towards more warnings whenever the performance is big-O worse than might be reasonably expected, especially because the user can always turn on fast mode to remove them. So, whatever version of these warnings is feasible would be helpful. To avoid over-warning for dense kernels, you could just detect when the inputs are all dense and screen warnings out in this case.

On the docs front, this seems like a good addition to the performance section. In general, a small section on operations/situations which result in densification might be useful, similar to the section on concordant iteration. Also, small note, the docs on assign here are where I got onto the overwrite function. It might be helpful to describe initwrite there and explain that it results in simple assignment.

Thanks again!

@willow-ahrens willow-ahrens changed the title Manual Program Instance Causing Densification idempotency optimization on outputs Dec 20, 2023
@willow-ahrens willow-ahrens changed the title idempotency optimization on outputs idempotency optimization on initialized outputs Dec 20, 2023
@willow-ahrens willow-ahrens added the priority Do this first label Dec 20, 2023
@wraith1995
Copy link
Collaborator

wraith1995 commented Dec 21, 2023

Two cents on dense detection: if a sparse level is used, we could look at all indices used to write to it and then ask if they correspond to loops in the generated code that have work at least equal to the size of that level? We could probably write a pretty naive way of computing the work that assigns while loops O(1) and for loops O(n).

@willow-ahrens
Copy link
Collaborator

At the very least, if all the inputs are sparse, we could replace them with their default and double check that the code is a no-op?

We might also run the same check when any input is sparse, rather than just when they all are.

What happens when we put two statements in a loop, and one of the statements is dense while the other is sparse?

Perhaps the appropriate check is to simply have @finch_checksparse that replaces all of the sparse levels with their defaults and then prints the simplified code, and leave it to the user to interpret what went wrong, if anything.

@willow-ahrens willow-ahrens added this to the v0.7.0 milestone Feb 27, 2024
@willow-ahrens willow-ahrens self-assigned this Apr 22, 2024
@willow-ahrens willow-ahrens removed the priority Do this first label Jun 12, 2024
@willow-ahrens willow-ahrens modified the milestones: v0.7.0, v1.0 Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants