mul! performance regression on master

On the current master branch, `mul!` of small matrices can be 10x slower than 1.3 (and also allocates)
```
julia> versioninfo()
Julia Version 1.4.0-DEV.556
Commit 4800158ef5* (2019-12-03 21:18 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.0.0)
  CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
```

test code:

```
using LinearAlgebra
using BenchmarkTools

ndim = 3

m1 = rand(ComplexF64,ndim,ndim);
m2 = rand(ComplexF64,ndim,ndim);
ou = rand(ComplexF64,ndim,ndim);

@btime mul!($ou, $m1, $m2);
```

With the release-1.3 version,
```
33.471 ns (0 allocations: 0 bytes)
```
on master:
```
394.428 ns (1 allocation: 16 bytes)
```

While it's nano-seconds, when one has `mul!` in the inner-most/hot loop, this can easily translate to big performance degradation when most of the calculation is with such matrix productions. Larger matrices (e.g. `ndim = 30`) appears unaffected (dispatched to BLAS?) `2.732 μs (0 allocations: 0 bytes)` on 1.3 and `2.774 μs (0 allocations: 0 bytes)` on master.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

mul! performance regression on master #684

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

mul! performance regression on master #684

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions