Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor following theory #57

Closed
2 tasks done
gdalle opened this issue May 9, 2024 · 9 comments
Closed
2 tasks done

Refactor following theory #57

gdalle opened this issue May 9, 2024 · 9 comments

Comments

@gdalle
Copy link
Collaborator

gdalle commented May 9, 2024


Dumping the code from our meeting

struct GradientTracer end
struct LocalGradientTracer end

(::AbstractSparseVector, ::AbstractSparseVector)

firstder_firstarg_nonzero(::typeof(max)) = true
firstder_firstarg_nonzero(::typeof(max), α, β) = β > α

for op in all_the_ops
    @eval begin
        function $op(t1::T, t2::T) where {T<:GlobalGradientTracer}
            gα = gradient_sparsity(t1)
            gβ = gradient_sparsity(t2)
            a = firstder_firstarg_nonzero(♣)
            b = firstder_secondarg_nonzero(♣)
            if a && b
                returnelseif a
                returnelseif b
                returnelse
                return nothing # empty
            end
        end

        function $op(t1::T, t2::T) where {T<:GlobalHessianTracer}
            gα = gradient_sparsity(t1)
            gβ = gradient_sparsity(t2)
            Hα = hessian_sparsity(t1)
            Hβ = hessian_sparsity(t2)
            a = firstder_firstarg_nonzero(op)
            b = firstder_secondarg_nonzero(op)
            c = seconder_firstard_nonzero(op)
            d = nothing #
            e = nothing #
            # more stuff
        end

        function $op(t1::T, t2::T) where {T<:LocalGradientTracer}
            α = primal(t1)
            β = primal(t2)
            gα = gradient_sparsity(t1)
            gβ = gradient_sparsity(t2)
            a = firstder_firstarg_nonzero(♣, α, β)
            b = firstder_secondarg_nonzero(♣, α, β)
            if a && b
                returnelseif a
                returnelseif b
                returnelse
                return nothing # empty
            end
        end
    end
end
@adrhill
Copy link
Owner

adrhill commented May 10, 2024

One key insight from the ongoing refactor in #59 is that we previously used dicts of sets to represent both hessian_sparsity (key => non-empty set) and gradient_sparsity (key => empty set).

Introducing a GlobalHessianTracer with two parametric types

struct GlobalHessianTracer{G,H} <: AbstractHessianTracer
    grad::G     # sparse binary vector representation of non-zero entries in the gradient
    hessian::H  # sparse binary matrix representation of non-zero entries in the Hessian
end

therefore isn't flexible enough to support our previous approach.

@gdalle
Copy link
Collaborator Author

gdalle commented May 10, 2024

I'm still not convinced a single dict can fully represent the gradient tracer as well. Aren't there cases where the gradient tracer has more nonzero coordinates than the hessian tracer?

@gdalle
Copy link
Collaborator Author

gdalle commented May 10, 2024

Typically a linear function

@adrhill
Copy link
Owner

adrhill commented May 10, 2024

Aren't there cases where the gradient tracer has more nonzero coordinates than the hessian tracer?

Right, those were represented as empty sets. To give you an example:

Dict(
  1 => (),     # <-- empty sets hold only gradient information
  3 => (4, 5),
  6 => (6),
)

represents the first-order information (all other $\frac{\partial f}{\partial x_i}=0$)

  • $\frac{\partial f}{\partial x_1} \neq 0$
  • $\frac{\partial f}{\partial x_3} \neq 0$
  • $\frac{\partial f}{\partial x_6} \neq 0$

and the second-order information (except for permutations of the cases below, all other $\frac{\partial^2 f}{\partial x_i \partial x_j} = 0$)

  • $\frac{\partial^2 f}{\partial x_3 \partial x_4} \neq 0$
  • $\frac{\partial^2 f}{\partial x_3 \partial x_5} \neq 0$
  • $\frac{\partial^2 f}{\partial x_6^2} \neq 0$

@gdalle
Copy link
Collaborator Author

gdalle commented May 10, 2024

Right, I had forgotten that part. My take is that we should do the implementation that sticks to the theory, and if we really lose performance we can think about reintroducing this trick. Presumably the hessian tracing is much more expensive than the gradient tracing anyway, so it doesn't add much cost to carry a gradient tracer around (plus it is necessary if you use the set of pairs representation, which I'm still convinced is the right one)

@adrhill
Copy link
Owner

adrhill commented May 10, 2024

This is the option I also tend towards.

An alternative would be to introduce wrapper structs for "first- and second-order information":

  • one for generic, separate first- and second- order information
  • one for this dict-based approach

@gdalle
Copy link
Collaborator Author

gdalle commented May 10, 2024

We don't have time for fancy schmancy stuff. Let's do the clean way that we are sure works, and after NeurIPS we can bikeshed such details.

@gdalle
Copy link
Collaborator Author

gdalle commented May 10, 2024

Plus now that we have CI benchmarks we'll know if there's a really bad regression!

@adrhill
Copy link
Owner

adrhill commented May 14, 2024

Task 1 was completed in #59. Closing in favor of the more specific #56.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants