Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of Matrix Chain Multiplication in Python #669

Closed
mtsokol opened this issue Dec 11, 2024 · 19 comments
Closed

Performance of Matrix Chain Multiplication in Python #669

mtsokol opened this issue Dec 11, 2024 · 19 comments

Comments

@mtsokol
Copy link
Member

mtsokol commented Dec 11, 2024

Hi @willow-ahrens @kylebd99,

I wanted to discuss a bit Matrix Chain Multiplication Python example that I'm working on.
The Python implementation relies on lazy indexing, as tensordot is slow for other examples, like SDDMM.
Here's Python implementation:

@sparse.compiled(opt=sparse.DefaultScheduler())
def matrix_mul_finch(a, b, s):
    return sparse.sum(a[:, None, :, None] * sparse.permute_dims(b, (1, 0))[None, :, :, None] * s[None, :, None, :], axis=(-2, -3))

But it looks like for chain multiplication it's also way slower compared to SciPy or Numba.
Here's a Julia snippet that reproduces it, give or take:

using Finch

LEN = 250;
a_raw = rand(LEN, LEN - 5);
b_raw = rand(LEN, LEN - 5);
c_raw = fsprand(Float64, LEN, LEN, 0.01);

a = lazy(swizzle(Tensor(a_raw), 1, 2));
b = lazy(swizzle(Tensor(b_raw), 1, 2));
c = lazy(swizzle(c_raw, 1, 2));

plan = sum(broadcast(*, broadcast(*, a[:, nothing, :, nothing], b[nothing, :, :, nothing]), c[nothing, :, nothing, :]), dims=(2, 3));

result = compute(plan);

@assert sum(broadcast(abs, Array(result) - (a_raw * transpose(b_raw) * Array(c_raw)))) < 1e-5

It looks like in each case the output is densified instead of staying sparse.

With verbose=true and a default scheduler logs I get:

:(function var"##compute#293"(prgm)
      begin
          V = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_2 = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_3 = (((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}
          A0 = V::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          A1 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          A2 = V_3::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}
          A3 = Tensor(Dense(SparseDict(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          @finch mode = :fast begin
                  A3 .= 0.0
                  for i1 = _
                      for i0 = _
                          for i3 = _
                              for i2 = _
                                  A3[i3, i0] << + >>= (*)((*)(A0[i0, i1], A1[i2, i1]), A2[i2, i3])
                              end
                          end
                      end
                  end
                  return A3
              end
          return (swizzle(A3, 2, 1),)
      end
  end)

With verbose=true and a Galley scheduler logs I get:

Input Queries :
Query(Alias(##A#400),
   Materialize(Value(t_undef),Value(t_undef),##i#275,##i#278,
      Aggregate(Value(+),Value(0.0),##i#276,##i#277,
         MapJoin(Value(*),
            MapJoin(Value(*),Input(Value(FIBER),##i#275,##i#277),Input(Value(FIBER),##i#276,##i#277)),Input(Value(FIBER),##i#276,##i#278)))))
FAQ Opt Time: 0.0069370269775390625
--------------- Logical Plan ---------------
Query(Alias(A_37),
   Aggregate(Value(+),Value(0.0),##i#276,
      MapJoin(Value(*),Input(Value(FIBER),##i#276,##i#277),Input(Value(FIBER),##i#276,##i#278))))
Query(##A#400,
   Materialize(Value(t_undef),Value(t_undef),##i#275,##i#278,
      Aggregate(Value(+),Value(0.0),##i#277,
         MapJoin(Value(*),Input(Value(FIBER),##i#275,##i#277),Alias(A_37)))))
--------------------------------------------
Physical Opt Time: 0.0069370269775390625
--------------- Physical Plan ---------------
Query(Alias(A_39,##i#276,##i#277),
   Materialize(Value(t_dense),Value(t_dense),##i#277,##i#276,
      Aggregate(Value(Finch.FinchNotation.InitWriter{0.0}()),Value(0.0),Input(##i#276::t_default,##i#277::t_default))),##i#277,##i#276)
Query(Alias(A_37,##i#278,##i#277),
   Materialize(Value(t_dense),Value(t_dense),##i#278,##i#277,
      Aggregate(Value(+),Value(0.0),##i#276,
         MapJoin(Value(*),Alias(A_39,##i#277::t_default,##i#276::t_follow),Input(##i#276::t_default,##i#278::t_default)))),##i#278,##i#276,##i#277)
Query(##A#400,
   Materialize(Value(t_dense),Value(t_dense),##i#275,##i#278,
      Aggregate(Value(+),Value(0.0),##i#277,
         MapJoin(Value(*),Input(##i#275::t_default,##i#277::t_default),Alias(A_37,##i#278::t_default,##i#277::t_follow)))),##i#277,##i#278,##i#275)
--------------------------------------------
Expected Output Size: 61250
Expected Output Size: 57085
Expected Output Size: 58250
Executing:
:(function var"##compute#401"(prgm)
      begin
          V = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_2 = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_3 = (((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}
          _A_1 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}(Float64[]), 250), 250))
          A_37 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}(Float64[]), 250), 245))
          A_39 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}(Float64[]), 245), 250))
          ()
          @finch mode = :safe begin
                  A_39 .= 0.0
                  for _i_2 = _
                      for _i_3 = _
                          A_39[_i_2, _i_3] << Finch.FinchNotation.InitWriter{0.0}() >>= V_2[_i_3, _i_2]
                      end
                  end
              end
          @finch mode = :safe begin
                  A_37 .= 0.0
                  for _i_4 = _
                      for _i_3 = _
                          for _i_2 = _
                              A_37[_i_4, _i_2] << + >>= (*)(V_3[_i_3, _i_4], A_39[_i_2, (follow)(_i_3)])
                          end
                      end
                  end
              end
          @finch mode = :safe begin
                  _A_1 .= 0.0
                  for _i_2 = _
                      for _i_4 = _
                          for _i_5 = _
                              _A_1[_i_5, _i_4] << + >>= (*)(A_37[_i_4, (follow)(_i_2)], V[_i_5, _i_2])
                          end
                      end
                  end
              end
          return Tuple([_A_1])
      end
  end)
┌ Warning: Performance Warning: non-concordant traversal of A_39[_i_2, _i_3] (hint: most arrays prefer column major or first index fast, run in fast mode to ignore this warning)
└ @ Finch ?:0
┌ Warning: Performance Warning: non-concordant traversal of A_37[_i_4, _i_6] (hint: most arrays prefer column major or first index fast, run in fast mode to ignore this warning)
└ @ Finch ?:0
@willow-ahrens
Copy link
Collaborator

Important note: using broadcast and .* is redundant. You can broadcast(*, A, B) or you can A .* B, but broadcast(.*, A, B) says to broadcast the broadcast.

@mtsokol
Copy link
Member Author

mtsokol commented Dec 11, 2024

Important note: using broadcast and .* is redundant. You can broadcast(*, A, B) or you can A .* B, but broadcast(.*, A, B) says to broadcast the broadcast.

Ah right, I updated the snippet and logs with broadcast(*, A, B), the performance question still applies, as from Python we always call broadcast(*, ...).

@willow-ahrens
Copy link
Collaborator

One note on using Galley: it optimizes for the first inputs you give it. Are you testing it with multiple inputs?

@mtsokol
Copy link
Member Author

mtsokol commented Dec 11, 2024

I'm benchmarking it for one set of input tensors. I have a given parametrization grid, e.g.:

configs = [
    {"LEN": 5000, "DENSITY": 0.00001},
    {"LEN": 10000, "DENSITY": 0.00001},
    {"LEN": 20000, "DENSITY": 0.00001},
    {"LEN": 25000, "DENSITY": 0.00001},
    {"LEN": 30000, "DENSITY": 0.00001},
]

and for each one of them I construct input tensors, create benchmarked functions (Finch, Galley, etc.), precompile them and benchmark them.

@willow-ahrens
Copy link
Collaborator

I see. Could you try giving Galley a new tag for each of them? I think you can set the tag parameter as a kwarg

@mtsokol
Copy link
Member Author

mtsokol commented Dec 11, 2024

Sure, I also tried with tag set, it's pretty much the same - here's a plot I'm getting for Python example execution:

image

And verbose mode from Python layer shows that for both cases (Default and Galley scheduler) output is densified:

:(function var"##compute#538"(prgm)
      begin
          V = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          V_2 = ((((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          V_3 = (((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, SparseListLevel{Int64, PlusOneVector{Int32, PyArray{Int32, 1, true, true, Int32}}, PlusOneVector{Int32, PyArray{Int32, 1, true, true, Int32}}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A0 = V::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A1 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A2 = V_3::Tensor{DenseLevel{Int64, SparseListLevel{Int64, PlusOneVector{Int32, PyArray{Int32, 1, true, true, Int32}}, PlusOneVector{Int32, PyArray{Int32, 1, true, true, Int32}}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A3 = Tensor(Dense(Dense(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          @finch mode = :fast begin
                  A3 .= 0.0
                  for i3 = _
                      for i1 = _
                          for i2 = _
                              for i0 = _
                                  A3[i0, i3] << + >>= (*)((*)(A0[i0, i1], A1[i2, i1]), A2[i2, i3])
                              end
                          end
                      end
                  end
                  return A3
              end
          return (A3,)
      end
  end)

@willow-ahrens
Copy link
Collaborator

Could you paste the benchmark code somewhere?

@mtsokol
Copy link
Member Author

mtsokol commented Dec 11, 2024

Here's a self-contained script: https://gist.github.com/mtsokol/5718e1e57b101b93c2b83d83112f56c0
It saves a plot.

@kylebd99
Copy link
Collaborator

Thanks for the script. This is super helpful. I'm taking a look now.

@willow-ahrens
Copy link
Collaborator

For what it's worth, on my machine from the julia interface I can get galley to use a sparse format:

julia> LEN = 500;

julia> a_raw = rand(LEN, LEN - 5);

julia> b_raw = rand(LEN, LEN - 5);

julia> c_raw = fsprand(Float64, LEN, LEN, 0.0001);

julia> a = lazy(swizzle(Tensor(a_raw), 1, 2));

julia> b = lazy(swizzle(Tensor(b_raw), 1, 2));

julia> c = lazy(swizzle(c_raw, 1, 2));

julia> plan = sum(broadcast(*, broadcast(*, a[:, nothing, :, nothing], b[nothing, :, :, nothing]), c[nothing, :, nothing, :]), dims=(2, 3));

julia> result = compute(plan, verbose=true, ctx=galley_scheduler());
Executing:
:(function var"##compute#362"(prgm)
      begin
          V = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_2 = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_3 = (((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}
          _A_1 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}([50.909509953661015, 51.55506500817316, 49.06051924835156, 49.11894756080337, 49.87697390527504, 51.484390325465256, 52.922481331798764, 47.14373046494763, 50.98300988790281, 48.20091698143638  …  94.18366400484176, 89.98885364709577, 94.91692055697541, 100.8374202778461, 89.78504712431399, 90.54363416225185, 86.77978806148055, 88.02084781211785, 89.35515622101124, 85.80792030912862]), 250), 250))
          A_6 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}([0.4546635288293994, 0.2660143798936316, 1.9663527207827898, 2.061613267816494, 0.06157434998014895, 0.6340272833405703, 0.04483174111819589, 0.9572180888714056, 1.209130133600966, 0.011503456800502358  …  0.01104175659091753, 0.43101164301046946, 0.0, 0.7050757022128256, 0.526619375142649, 1.4194965342488919, 0.07713399632958723, 1.806008201240464, 0.8964672343053504, 0.3577360296716781]), 250), 245))
          A_8 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}([0.6544506386562216, 0.31612465185721306, 0.7217552753658356, 0.6769946283675499, 0.4319074661143846, 0.9789452340361754, 0.12421530550352888, 0.5644116806267045, 0.40386088070597703, 0.08543785205858079  …  0.9105396373084738, 0.5685989532103138, 0.07443162741339726, 0.14851716325285946, 0.7261718834208045, 0.39879687533605557, 0.3295145198985465, 0.05716585178202094, 0.07699274243448839, 0.8397302478662261]), 245), 250))
          ()
          @finch mode = :fast begin
                  A_8 .= 0.0
                  for _i_2 = _
                      for _i_3 = _
                          A_8[_i_2, _i_3] << Finch.FinchNotation.InitWriter{0.0}() >>= V_2[_i_3, _i_2]
                      end
                  end
              end
          @finch mode = :fast begin
                  A_6 .= 0.0
                  for _i_4 = _
                      for _i_3 = _
                          for _i_2 = _
                              A_6[_i_4, _i_2] << + >>= (*)(V_3[_i_3, _i_4], A_8[_i_2, (follow)(_i_3)])
                          end
                      end
                  end
              end
          @finch mode = :fast begin
                  _A_1 .= 0.0
                  for _i_2 = _
                      for _i_4 = _
                          for _i_5 = _
                              _A_1[_i_5, _i_4] << + >>= (*)(A_6[_i_4, (follow)(_i_2)], V[_i_5, _i_2])
                          end
                      end
                  end
              end
          return Tuple([_A_1])
      end
  end)

julia> result = compute(plan, verbose=true, ctx=galley_scheduler(), tag=:new);
Executing:
:(function var"##compute#447"(prgm)
      begin
          V = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_2 = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_3 = (((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}
          _A_1 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}(Float64[]), 500), 500))
          A_14 = Tensor(Dense{Int32}(SparseByteMap{Int32}(Element{0.0, Float64, Int64}(Float64[]), 500, [1], Bool[], Tuple{Int64, Int32}[]), 495))
          A_16 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}(Float64[]), 495), 500))
          ()
          @finch mode = :fast begin
                  A_16 .= 0.0
                  for _i_2 = _
                      for _i_3 = _
                          A_16[_i_2, _i_3] << Finch.FinchNotation.InitWriter{0.0}() >>= V_2[_i_3, _i_2]
                      end
                  end
              end
          @finch mode = :fast begin
                  A_14 .= 0.0
                  for _i_4 = _
                      for _i_3 = _
                          for _i_2 = _
                              A_14[_i_4, _i_2] << + >>= (*)(V_3[_i_3, _i_4], A_16[_i_2, (follow)(_i_3)])
                          end
                      end
                  end
              end
          @finch mode = :fast begin
                  _A_1 .= 0.0
                  for _i_2 = _
                      for _i_4 = _
                          for _i_5 = _
                              _A_1[_i_5, _i_4] << + >>= (*)(V[_i_5, _i_2], A_14[(walk)(_i_4), (follow)(_i_2)])
                          end
                      end
                  end
              end
          return Tuple([_A_1])
      end
  end)

@mtsokol
Copy link
Member Author

mtsokol commented Dec 11, 2024

Another thing to discuss tomorrow: We can revisit tensordot and see if it make sense to use it instead of lazy indexing (lazy indexing is rather temporary, we would like to be able to call a @ b and get decent performance).

@willow-ahrens
Copy link
Collaborator

they should be more similar now. Both the default julia scheduler and the galley scheduler have been adjusted to work better with tensordot

@kylebd99
Copy link
Collaborator

I'm a bit lost, what did you change?

@willow-ahrens
Copy link
Collaborator

it used to be that we needed to do lazy indexing in order to fuse sddmm in the default scheduler. Now that I have implemented backwards fusion, this is no longer necessary. This was never necessary with Galley.

@kylebd99
Copy link
Collaborator

Okay, I spent maybe too much time thinking about this today. But, I have a few findings.

  1. Unless you set the following vars, both scipy and numba will run in parallel, giving them a significant boost.
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMBA_NUM_THREADS"] = "1"
  1. At the settings used, the dense matmul is very fast, so what we're seeing is primarily a difference in the speed of doing dense computation. I think that numba and scipy are just a lot better at this, probably due to vectorization. There might still be some low hanging fruit to make Finch better here, but I'm not sure. In experiments where the matrices are just a bit larger, this approach leads to issues for scipy and numba. See below with 4x larger dimensions and 100x more sparse C.

chain_mul_plot2

  1. Mixing the storage order of the dense arrays ('f' vs 'c') seems to slow it down. It was fastest for me when both were set to 'c'.

  2. There was an issue with the tag parameter being passed in the compiled decorator. I think that this led to both methods using Galley's plan. When I fixed the tag, the default scheduler became significantly worse as it tried to do the whole thing in one kernel.

chain_mul_plot

@willow-ahrens
Copy link
Collaborator

This could be solved with a foreign function interface, where Finch would be allowed to send kernels or subkernels to library implementations.

@willow-ahrens
Copy link
Collaborator

One way to make this a more compelling plot in the short term would be to use 10% sparse matrices for A and B. This would allow us to compete on a more even playing field.

@kylebd99
Copy link
Collaborator

Given this, are we comfortable closing this issue?

@willow-ahrens
Copy link
Collaborator

I don't think this is pressing. We should add an ffi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants