Performance of Matrix Chain Multiplication in Python #669

mtsokol · 2024-12-11T17:37:41Z

I wanted to discuss a bit Matrix Chain Multiplication Python example that I'm working on.
The Python implementation relies on lazy indexing, as tensordot is slow for other examples, like SDDMM.
Here's Python implementation:

@sparse.compiled(opt=sparse.DefaultScheduler())
def matrix_mul_finch(a, b, s):
    return sparse.sum(a[:, None, :, None] * sparse.permute_dims(b, (1, 0))[None, :, :, None] * s[None, :, None, :], axis=(-2, -3))

But it looks like for chain multiplication it's also way slower compared to SciPy or Numba.
Here's a Julia snippet that reproduces it, give or take:

using Finch

LEN = 250;
a_raw = rand(LEN, LEN - 5);
b_raw = rand(LEN, LEN - 5);
c_raw = fsprand(Float64, LEN, LEN, 0.01);

a = lazy(swizzle(Tensor(a_raw), 1, 2));
b = lazy(swizzle(Tensor(b_raw), 1, 2));
c = lazy(swizzle(c_raw, 1, 2));

plan = sum(broadcast(*, broadcast(*, a[:, nothing, :, nothing], b[nothing, :, :, nothing]), c[nothing, :, nothing, :]), dims=(2, 3));

result = compute(plan);

@assert sum(broadcast(abs, Array(result) - (a_raw * transpose(b_raw) * Array(c_raw)))) < 1e-5

It looks like in each case the output is densified instead of staying sparse.

With verbose=true and a default scheduler logs I get:

:(function var"##compute#293"(prgm)
      begin
          V = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_2 = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_3 = (((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}
          A0 = V::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          A1 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          A2 = V_3::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}
          A3 = Tensor(Dense(SparseDict(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, SparseDictLevel{Int64, Vector{Int64}, Vector{Int64}, Vector{Int64}, Dict{Tuple{Int64, Int64}, Int64}, Vector{Int64}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          @finch mode = :fast begin
                  A3 .= 0.0
                  for i1 = _
                      for i0 = _
                          for i3 = _
                              for i2 = _
                                  A3[i3, i0] << + >>= (*)((*)(A0[i0, i1], A1[i2, i1]), A2[i2, i3])
                              end
                          end
                      end
                  end
                  return A3
              end
          return (swizzle(A3, 2, 1),)
      end
  end)

With verbose=true and a Galley scheduler logs I get:

Input Queries :
Query(Alias(##A#400),
   Materialize(Value(t_undef),Value(t_undef),##i#275,##i#278,
      Aggregate(Value(+),Value(0.0),##i#276,##i#277,
         MapJoin(Value(*),
            MapJoin(Value(*),Input(Value(FIBER),##i#275,##i#277),Input(Value(FIBER),##i#276,##i#277)),Input(Value(FIBER),##i#276,##i#278)))))
FAQ Opt Time: 0.0069370269775390625
--------------- Logical Plan ---------------
Query(Alias(A_37),
   Aggregate(Value(+),Value(0.0),##i#276,
      MapJoin(Value(*),Input(Value(FIBER),##i#276,##i#277),Input(Value(FIBER),##i#276,##i#278))))
Query(##A#400,
   Materialize(Value(t_undef),Value(t_undef),##i#275,##i#278,
      Aggregate(Value(+),Value(0.0),##i#277,
         MapJoin(Value(*),Input(Value(FIBER),##i#275,##i#277),Alias(A_37)))))
--------------------------------------------
Physical Opt Time: 0.0069370269775390625
--------------- Physical Plan ---------------
Query(Alias(A_39,##i#276,##i#277),
   Materialize(Value(t_dense),Value(t_dense),##i#277,##i#276,
      Aggregate(Value(Finch.FinchNotation.InitWriter{0.0}()),Value(0.0),Input(##i#276::t_default,##i#277::t_default))),##i#277,##i#276)
Query(Alias(A_37,##i#278,##i#277),
   Materialize(Value(t_dense),Value(t_dense),##i#278,##i#277,
      Aggregate(Value(+),Value(0.0),##i#276,
         MapJoin(Value(*),Alias(A_39,##i#277::t_default,##i#276::t_follow),Input(##i#276::t_default,##i#278::t_default)))),##i#278,##i#276,##i#277)
Query(##A#400,
   Materialize(Value(t_dense),Value(t_dense),##i#275,##i#278,
      Aggregate(Value(+),Value(0.0),##i#277,
         MapJoin(Value(*),Input(##i#275::t_default,##i#277::t_default),Alias(A_37,##i#278::t_default,##i#277::t_follow)))),##i#277,##i#278,##i#275)
--------------------------------------------
Expected Output Size: 61250
Expected Output Size: 57085
Expected Output Size: 58250
Executing:
:(function var"##compute#401"(prgm)
      begin
          V = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_2 = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_3 = (((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}
          _A_1 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}(Float64[]), 250), 250))
          A_37 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}(Float64[]), 250), 245))
          A_39 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}(Float64[]), 245), 250))
          ()
          @finch mode = :safe begin
                  A_39 .= 0.0
                  for _i_2 = _
                      for _i_3 = _
                          A_39[_i_2, _i_3] << Finch.FinchNotation.InitWriter{0.0}() >>= V_2[_i_3, _i_2]
                      end
                  end
              end
          @finch mode = :safe begin
                  A_37 .= 0.0
                  for _i_4 = _
                      for _i_3 = _
                          for _i_2 = _
                              A_37[_i_4, _i_2] << + >>= (*)(V_3[_i_3, _i_4], A_39[_i_2, (follow)(_i_3)])
                          end
                      end
                  end
              end
          @finch mode = :safe begin
                  _A_1 .= 0.0
                  for _i_2 = _
                      for _i_4 = _
                          for _i_5 = _
                              _A_1[_i_5, _i_4] << + >>= (*)(A_37[_i_4, (follow)(_i_2)], V[_i_5, _i_2])
                          end
                      end
                  end
              end
          return Tuple([_A_1])
      end
  end)
┌ Warning: Performance Warning: non-concordant traversal of A_39[_i_2, _i_3] (hint: most arrays prefer column major or first index fast, run in fast mode to ignore this warning)
└ @ Finch ?:0
┌ Warning: Performance Warning: non-concordant traversal of A_37[_i_4, _i_6] (hint: most arrays prefer column major or first index fast, run in fast mode to ignore this warning)
└ @ Finch ?:0

The text was updated successfully, but these errors were encountered:

willow-ahrens · 2024-12-11T17:45:00Z

Important note: using broadcast and .* is redundant. You can broadcast(*, A, B) or you can A .* B, but broadcast(.*, A, B) says to broadcast the broadcast.

mtsokol · 2024-12-11T17:53:38Z

Important note: using broadcast and .* is redundant. You can broadcast(*, A, B) or you can A .* B, but broadcast(.*, A, B) says to broadcast the broadcast.

Ah right, I updated the snippet and logs with broadcast(*, A, B), the performance question still applies, as from Python we always call broadcast(*, ...).

willow-ahrens · 2024-12-11T17:55:59Z

One note on using Galley: it optimizes for the first inputs you give it. Are you testing it with multiple inputs?

mtsokol · 2024-12-11T18:03:41Z

I'm benchmarking it for one set of input tensors. I have a given parametrization grid, e.g.:

configs = [
    {"LEN": 5000, "DENSITY": 0.00001},
    {"LEN": 10000, "DENSITY": 0.00001},
    {"LEN": 20000, "DENSITY": 0.00001},
    {"LEN": 25000, "DENSITY": 0.00001},
    {"LEN": 30000, "DENSITY": 0.00001},
]

and for each one of them I construct input tensors, create benchmarked functions (Finch, Galley, etc.), precompile them and benchmark them.

willow-ahrens · 2024-12-11T18:05:24Z

I see. Could you try giving Galley a new tag for each of them? I think you can set the tag parameter as a kwarg

mtsokol · 2024-12-11T18:15:53Z

Sure, I also tried with tag set, it's pretty much the same - here's a plot I'm getting for Python example execution:

And verbose mode from Python layer shows that for both cases (Default and Galley scheduler) output is densified:

:(function var"##compute#538"(prgm)
      begin
          V = ((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          V_2 = ((((((((((((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          V_3 = (((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, SparseListLevel{Int64, PlusOneVector{Int32, PyArray{Int32, 1, true, true, Int32}}, PlusOneVector{Int32, PyArray{Int32, 1, true, true, Int32}}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A0 = V::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A1 = V_2::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A2 = V_3::Tensor{DenseLevel{Int64, SparseListLevel{Int64, PlusOneVector{Int32, PyArray{Int32, 1, true, true, Int32}}, PlusOneVector{Int32, PyArray{Int32, 1, true, true, Int32}}, ElementLevel{0.0, Float64, Int64, PyArray{Float64, 1, true, true, Float64}}}}}
          A3 = Tensor(Dense(Dense(Element{0.0, Float64}())))::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          @finch mode = :fast begin
                  A3 .= 0.0
                  for i3 = _
                      for i1 = _
                          for i2 = _
                              for i0 = _
                                  A3[i0, i3] << + >>= (*)((*)(A0[i0, i1], A1[i2, i1]), A2[i2, i3])
                              end
                          end
                      end
                  end
                  return A3
              end
          return (A3,)
      end
  end)

willow-ahrens · 2024-12-11T18:19:35Z

Could you paste the benchmark code somewhere?

mtsokol · 2024-12-11T18:37:14Z

Here's a self-contained script: https://gist.github.com/mtsokol/5718e1e57b101b93c2b83d83112f56c0
It saves a plot.

kylebd99 · 2024-12-11T18:47:41Z

Thanks for the script. This is super helpful. I'm taking a look now.

willow-ahrens · 2024-12-11T18:49:32Z

For what it's worth, on my machine from the julia interface I can get galley to use a sparse format:

julia> LEN = 500;

julia> a_raw = rand(LEN, LEN - 5);

julia> b_raw = rand(LEN, LEN - 5);

julia> c_raw = fsprand(Float64, LEN, LEN, 0.0001);

julia> a = lazy(swizzle(Tensor(a_raw), 1, 2));

julia> b = lazy(swizzle(Tensor(b_raw), 1, 2));

julia> c = lazy(swizzle(c_raw, 1, 2));

julia> plan = sum(broadcast(*, broadcast(*, a[:, nothing, :, nothing], b[nothing, :, :, nothing]), c[nothing, :, nothing, :]), dims=(2, 3));

julia> result = compute(plan, verbose=true, ctx=galley_scheduler());
Executing:
:(function var"##compute#362"(prgm)
      begin
          V = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_2 = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_3 = (((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}
          _A_1 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}([50.909509953661015, 51.55506500817316, 49.06051924835156, 49.11894756080337, 49.87697390527504, 51.484390325465256, 52.922481331798764, 47.14373046494763, 50.98300988790281, 48.20091698143638  …  94.18366400484176, 89.98885364709577, 94.91692055697541, 100.8374202778461, 89.78504712431399, 90.54363416225185, 86.77978806148055, 88.02084781211785, 89.35515622101124, 85.80792030912862]), 250), 250))
          A_6 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}([0.4546635288293994, 0.2660143798936316, 1.9663527207827898, 2.061613267816494, 0.06157434998014895, 0.6340272833405703, 0.04483174111819589, 0.9572180888714056, 1.209130133600966, 0.011503456800502358  …  0.01104175659091753, 0.43101164301046946, 0.0, 0.7050757022128256, 0.526619375142649, 1.4194965342488919, 0.07713399632958723, 1.806008201240464, 0.8964672343053504, 0.3577360296716781]), 250), 245))
          A_8 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}([0.6544506386562216, 0.31612465185721306, 0.7217552753658356, 0.6769946283675499, 0.4319074661143846, 0.9789452340361754, 0.12421530550352888, 0.5644116806267045, 0.40386088070597703, 0.08543785205858079  …  0.9105396373084738, 0.5685989532103138, 0.07443162741339726, 0.14851716325285946, 0.7261718834208045, 0.39879687533605557, 0.3295145198985465, 0.05716585178202094, 0.07699274243448839, 0.8397302478662261]), 245), 250))
          ()
          @finch mode = :fast begin
                  A_8 .= 0.0
                  for _i_2 = _
                      for _i_3 = _
                          A_8[_i_2, _i_3] << Finch.FinchNotation.InitWriter{0.0}() >>= V_2[_i_3, _i_2]
                      end
                  end
              end
          @finch mode = :fast begin
                  A_6 .= 0.0
                  for _i_4 = _
                      for _i_3 = _
                          for _i_2 = _
                              A_6[_i_4, _i_2] << + >>= (*)(V_3[_i_3, _i_4], A_8[_i_2, (follow)(_i_3)])
                          end
                      end
                  end
              end
          @finch mode = :fast begin
                  _A_1 .= 0.0
                  for _i_2 = _
                      for _i_4 = _
                          for _i_5 = _
                              _A_1[_i_5, _i_4] << + >>= (*)(A_6[_i_4, (follow)(_i_2)], V[_i_5, _i_2])
                          end
                      end
                  end
              end
          return Tuple([_A_1])
      end
  end)

julia> result = compute(plan, verbose=true, ctx=galley_scheduler(), tag=:new);
Executing:
:(function var"##compute#447"(prgm)
      begin
          V = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_2 = ((((((((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[2]).children[1]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{DenseLevel{Int64, DenseLevel{Int64, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}}
          V_3 = (((((((((((((((prgm.children[1]).children[2]).children[2]).children[3]).children[1]).children[2]).children[1]).children[3]).children[1]).children[1]).children[1]).children[1]).children[1]).children[1]).children[2]).tns.val::Tensor{SparseCOOLevel{2, Tuple{Int64, Int64}, Vector{Int64}, Tuple{Vector{Int64}, Vector{Int64}}, ElementLevel{0.0, Float64, Int64, Vector{Float64}}}}
          _A_1 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}(Float64[]), 500), 500))
          A_14 = Tensor(Dense{Int32}(SparseByteMap{Int32}(Element{0.0, Float64, Int64}(Float64[]), 500, [1], Bool[], Tuple{Int64, Int32}[]), 495))
          A_16 = Tensor(Dense{Int32}(Dense{Int32}(Element{0.0, Float64, Int64}(Float64[]), 495), 500))
          ()
          @finch mode = :fast begin
                  A_16 .= 0.0
                  for _i_2 = _
                      for _i_3 = _
                          A_16[_i_2, _i_3] << Finch.FinchNotation.InitWriter{0.0}() >>= V_2[_i_3, _i_2]
                      end
                  end
              end
          @finch mode = :fast begin
                  A_14 .= 0.0
                  for _i_4 = _
                      for _i_3 = _
                          for _i_2 = _
                              A_14[_i_4, _i_2] << + >>= (*)(V_3[_i_3, _i_4], A_16[_i_2, (follow)(_i_3)])
                          end
                      end
                  end
              end
          @finch mode = :fast begin
                  _A_1 .= 0.0
                  for _i_2 = _
                      for _i_4 = _
                          for _i_5 = _
                              _A_1[_i_5, _i_4] << + >>= (*)(V[_i_5, _i_2], A_14[(walk)(_i_4), (follow)(_i_2)])
                          end
                      end
                  end
              end
          return Tuple([_A_1])
      end
  end)

mtsokol · 2024-12-11T18:49:56Z

Another thing to discuss tomorrow: We can revisit tensordot and see if it make sense to use it instead of lazy indexing (lazy indexing is rather temporary, we would like to be able to call a @ b and get decent performance).

willow-ahrens · 2024-12-11T18:51:02Z

they should be more similar now. Both the default julia scheduler and the galley scheduler have been adjusted to work better with tensordot

kylebd99 · 2024-12-11T19:04:31Z

I'm a bit lost, what did you change?

willow-ahrens · 2024-12-11T19:24:47Z

it used to be that we needed to do lazy indexing in order to fuse sddmm in the default scheduler. Now that I have implemented backwards fusion, this is no longer necessary. This was never necessary with Galley.

kylebd99 · 2024-12-12T01:21:48Z

Okay, I spent maybe too much time thinking about this today. But, I have a few findings.

Unless you set the following vars, both scipy and numba will run in parallel, giving them a significant boost.

os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["NUMBA_NUM_THREADS"] = "1"

At the settings used, the dense matmul is very fast, so what we're seeing is primarily a difference in the speed of doing dense computation. I think that numba and scipy are just a lot better at this, probably due to vectorization. There might still be some low hanging fruit to make Finch better here, but I'm not sure. In experiments where the matrices are just a bit larger, this approach leads to issues for scipy and numba. See below with 4x larger dimensions and 100x more sparse C.

Mixing the storage order of the dense arrays ('f' vs 'c') seems to slow it down. It was fastest for me when both were set to 'c'.
There was an issue with the tag parameter being passed in the compiled decorator. I think that this led to both methods using Galley's plan. When I fixed the tag, the default scheduler became significantly worse as it tried to do the whole thing in one kernel.

willow-ahrens · 2024-12-12T15:25:16Z

This could be solved with a foreign function interface, where Finch would be allowed to send kernels or subkernels to library implementations.

willow-ahrens · 2024-12-12T15:27:26Z

One way to make this a more compelling plot in the short term would be to use 10% sparse matrices for A and B. This would allow us to compete on a more even playing field.

kylebd99 · 2024-12-13T18:01:32Z

Given this, are we comfortable closing this issue?

willow-ahrens · 2024-12-13T18:45:45Z

I don't think this is pressing. We should add an ffi

mtsokol mentioned this issue Dec 12, 2024

Update sparse_finch.ipynb pydata/sparse#827

Merged

willow-ahrens closed this as completed Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of Matrix Chain Multiplication in Python #669

Performance of Matrix Chain Multiplication in Python #669

mtsokol commented Dec 11, 2024 •

edited

Loading

willow-ahrens commented Dec 11, 2024

mtsokol commented Dec 11, 2024

willow-ahrens commented Dec 11, 2024

mtsokol commented Dec 11, 2024

willow-ahrens commented Dec 11, 2024

mtsokol commented Dec 11, 2024 •

edited

Loading

willow-ahrens commented Dec 11, 2024

mtsokol commented Dec 11, 2024 •

edited

Loading

kylebd99 commented Dec 11, 2024

willow-ahrens commented Dec 11, 2024

mtsokol commented Dec 11, 2024

willow-ahrens commented Dec 11, 2024

kylebd99 commented Dec 11, 2024

willow-ahrens commented Dec 11, 2024

kylebd99 commented Dec 12, 2024

willow-ahrens commented Dec 12, 2024

willow-ahrens commented Dec 12, 2024

kylebd99 commented Dec 13, 2024

willow-ahrens commented Dec 13, 2024

Performance of Matrix Chain Multiplication in Python #669

Performance of Matrix Chain Multiplication in Python #669

Comments

mtsokol commented Dec 11, 2024 • edited Loading

willow-ahrens commented Dec 11, 2024

mtsokol commented Dec 11, 2024

willow-ahrens commented Dec 11, 2024

mtsokol commented Dec 11, 2024

willow-ahrens commented Dec 11, 2024

mtsokol commented Dec 11, 2024 • edited Loading

willow-ahrens commented Dec 11, 2024

mtsokol commented Dec 11, 2024 • edited Loading

kylebd99 commented Dec 11, 2024

willow-ahrens commented Dec 11, 2024

mtsokol commented Dec 11, 2024

willow-ahrens commented Dec 11, 2024

kylebd99 commented Dec 11, 2024

willow-ahrens commented Dec 11, 2024

kylebd99 commented Dec 12, 2024

willow-ahrens commented Dec 12, 2024

willow-ahrens commented Dec 12, 2024

kylebd99 commented Dec 13, 2024

willow-ahrens commented Dec 13, 2024

mtsokol commented Dec 11, 2024 •

edited

Loading

mtsokol commented Dec 11, 2024 •

edited

Loading

mtsokol commented Dec 11, 2024 •

edited

Loading