Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in RxInfer code on current master #50704

Closed
bvdmitri opened this issue Jul 28, 2023 · 23 comments
Closed

Performance regression in RxInfer code on current master #50704

bvdmitri opened this issue Jul 28, 2023 · 23 comments
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Milestone

Comments

@bvdmitri
Copy link
Contributor

bvdmitri commented Jul 28, 2023

We have a set of small benchmarks to quickly test our code in RxInfer. The aim of the package is to run efficient Bayesian inference, potentially on low-power low-memory devices like RaspberryPI. We just noticed, that on Julia 1.10 we have quite a noticeable GC regression. Consider this notebook. Not an MWE but still, this notebook computes Bayesian posteriors in a simple linear Gaussian state-space probabilistic model. There are two settings:

  • Filtering, for each time step $t$ use observations up to the time step $t$.
  • Smoothing, for each time step $t$ use observations up to the time step $T > t$

Here are the results on the current Julia release

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)

julia> @benchmark run_filtering($datastream, $n, $v)
BenchmarkTools.Trial: 1504 samples with 1 evaluation.
 Range (min  max):  2.633 ms  13.932 ms  ┊ GC (min  max): 0.00%  69.28%
 Time  (median):     3.073 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.319 ms ±  1.058 ms  ┊ GC (mean ± σ):  7.08% ± 13.05%

   ▅▇▇██▇▅▃▂                                                 ▁
  ██████████▇▇▅▇█▇▇▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇██▇▆█▇▆▄▄▅ █
  2.63 ms      Histogram: log(frequency) by time     7.92 ms <

 Memory estimate: 2.35 MiB, allocs estimate: 63823.

julia> @benchmark run_smoothing($data, $n, $v)
BenchmarkTools.Trial: 288 samples with 1 evaluation.
 Range (min  max):  13.868 ms  29.987 ms  ┊ GC (min  max):  0.00%  35.63%
 Time  (median):     15.545 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   17.411 ms ±  3.975 ms  ┊ GC (mean ± σ):  10.81% ± 14.33%

   ▄▃█▁▄▅▁
  ▇███████▇▆▅▅▃▃▄▂▄▃▂▁▃▃▁▁▁▁▁▁▁▁▁▂▃▅▃▃▅▅▄▃▂▃▃▃▃▃▂▂▁▄▂▁▃▁▁▂▂▂▂ ▃
  13.9 ms         Histogram: frequency by time        28.4 ms <

 Memory estimate: 10.05 MiB, allocs estimate: 220417.

Here are the results on the 1.10-beta1

julia> versioninfo()
Julia Version 1.10.0-beta1
Commit 6616549950e (2023-07-25 17:43 UTC)

julia> @benchmark run_filtering($datastream, $n, $v)
BenchmarkTools.Trial: 1308 samples with 1 evaluation.
 Range (min  max):  3.260 ms  78.207 ms  ┊ GC (min  max): 0.00%  94.71%
 Time  (median):     3.479 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.818 ms ±  3.293 ms  ┊ GC (mean ± σ):  6.64% ±  7.41%

      ▄▆██▅▁
  ▂▃▄▇██████▇▅▅▃▃▃▃▃▂▂▃▃▃▃▃▃▂▃▃▂▂▂▁▂▂▂▂▁▂▁▂▂▁▂▂▁▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  3.26 ms        Histogram: frequency by time        4.94 ms <

 Memory estimate: 2.51 MiB, allocs estimate: 69824.

julia> @benchmark run_smoothing($data, $n, $v)
BenchmarkTools.Trial: 291 samples with 1 evaluation.
 Range (min  max):  15.160 ms  88.841 ms  ┊ GC (min  max): 0.00%  79.71%
 Time  (median):     15.757 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.336 ms ±  7.862 ms  ┊ GC (mean ± σ):  7.05% ± 11.57%

  █▅▁
  █████▇▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▄ ▅
  15.2 ms      Histogram: log(frequency) by time      57.9 ms <

 Memory estimate: 10.12 MiB, allocs estimate: 222915.

As you can see in the case of run_filtering, the maximum time jumped from 13ms to 78ms. The GC max also indicates a jump from 69% to 94%. In the case of run_smoothing the situation is similar, the maximum time jumped from 29ms to 88ms. The GC max jumped from 35% to 79%.

The inference precedure allocates a lot of intermediate "messages" in a form of distributions from Distributions.jl package, but does not use any sampling. Instead, it computes analytical solutions for posteriors. This analytical solutions also rely on dynamic multiple dispatch in many places. Eliminating dynamic multiple dispatch is not really an option, it just how it works and it was quite efficient anyway until now.

The major differences between two functions is that run_filtering allocates a lot of information (messages) and do not use it afterwards that is probably can be free-ed right away, and the run_smoothing retains/stores this information till the end of the procedure. You can also see that the resulting minimum execution time is also worse in both cases.

I think this is quite a severe regression, especially for the filtering case, which is supposed to run the real-time Bayesian inference with as little GC pauses as possible. We can of course refine our code base, but in the mean-time can it be improved in general? What can cause this? How should we proceed and debug this? How can we help figuring out further?

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 12 × Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 12 virtual cores

julia>
@KristofferC
Copy link
Member

KristofferC commented Jul 28, 2023

Thanks for the report and the nice reproducers. The GC heuristics were significantly changed in #50144 so it isn't too surprising that there might be cases where it needs to be further refined.

Putting this on the milestone for now so it isn't forgotten at least.

@KristofferC KristofferC added performance Must go faster regression Regression in behavior compared to a previous version GC Garbage collector labels Jul 28, 2023
@KristofferC KristofferC modified the milestone: 1.10 Jul 28, 2023
@bvdmitri
Copy link
Contributor Author

Thanks, glad to help!

Since the reproducers are not MWE, to give you an idea what might be happening, I think one potential problem here is that the program (as a part of the algorithm) allocates A LOT of small and short-lived objects. The algorithm, on the high-level, involves creating a sparse graph and passing messages between the nodes in the graph. These messages are needed only for a short period of time. In the benchmark example the messages are encoded as the Normal distributions, which consists of two numbers of Float64. These messages are allocated in the heap and are part of the dynamic multiple dispatch, because the algorithm does not (and can not) know in advance the types of the messages.

As a side note, I could not run the benchmark on master because some of the downstream packages failed to compile on master.

@gbaraldi
Copy link
Member

gbaraldi commented Jul 28, 2023

So I ran your code (1.10 + #50682) and I see a wall time regression, but not necessarily a GC, because max GC time might not be the best way to quantify it. Specifically because if in one case we do 100 runs and only in one of the runs we do a collection but it's a big collection, we might see 90% max GC usage, but median/mean might be lower. The more concerning thing is that there seems to be a regression overall in the performance of the code that I'm not sure if it's just GC related.

1.9

julia> @benchmark run_filtering(datastream, n, $v)
BenchmarkTools.Trial: 2421 samples with 1 evaluation.
 Range (min  max):  1.847 ms    5.287 ms  ┊ GC (min  max): 0.00%  60.24%
 Time  (median):     1.906 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.064 ms ± 612.807 μs  ┊ GC (mean ± σ):  6.81% ± 12.77%

  ▃█
  ██▅▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▂ ▂
  1.85 ms         Histogram: frequency by time        4.66 ms <

 Memory estimate: 2.35 MiB, allocs estimate: 63824.

julia> @benchmark run_smoothing(data, n, $v)
BenchmarkTools.Trial: 476 samples with 1 evaluation.
 Range (min  max):   8.658 ms  28.003 ms  ┊ GC (min  max):  0.00%  29.64%
 Time  (median):      9.164 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   10.508 ms ±  2.647 ms  ┊ GC (mean ± σ):  10.46% ± 13.81%

  ▆█▅▄▄▂▁                       ▁▁
  ███████▇███▇▆▄▅▁▅▁▄▁▁▄▄▄▄▇▇▇▆▇███▇█▇█▇█▆▇▇▅▅▆▄▅▄▅▁▁▁▁▁▁▄▁▄▄ ▇
  8.66 ms      Histogram: log(frequency) by time        18 ms <

 Memory estimate: 10.05 MiB, allocs estimate: 220418.

1.10 + that PR

julia> @benchmark run_filtering(datastream, n, $v)
BenchmarkTools.Trial: 1642 samples with 1 evaluation.
 Range (min  max):  2.746 ms  23.783 ms  ┊ GC (min  max): 0.00%  87.46%
 Time  (median):     2.908 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.044 ms ±  1.398 ms  ┊ GC (mean ± σ):  4.46% ±  8.07%

    ▃▆▅▁▆▂▄▄▃▅▄▇▆█▅▇▆▆▇██▅▆▂▅▂
  ▃▇███████████████████████████▇▇▇▆▅▄▃▃▃▂▂▃▂▂▁▂▁▂▂▁▁▁▁▁▁▁▁▁▂ ▅
  2.75 ms        Histogram: frequency by time         3.3 ms <

 Memory estimate: 2.51 MiB, allocs estimate: 69825.

julia> @benchmark run_smoothing(data, n, $v)
BenchmarkTools.Trial: 416 samples with 1 evaluation.
 Range (min  max):  10.883 ms  33.700 ms  ┊ GC (min  max): 0.00%  65.24%
 Time  (median):     11.343 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.041 ms ±  3.190 ms  ┊ GC (mean ± σ):  4.34% ± 10.21%

  █▇█
  ███▆▁▁▁▄▆▇▄▁▄▄▁▁▁▁▁▁▄▁▄▁▄▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▄▇▄ ▆
  10.9 ms      Histogram: log(frequency) by time        28 ms <

 Memory estimate: 10.12 MiB, allocs estimate: 222916.

So the mean GC went slightly down. But notice how the minimum time went up meaning that there is another regression somewhere, specially because on the minimum run the GC didn't run at all.

@bvdmitri
Copy link
Contributor Author

@gbaraldi I agree with you that it could be some sort of other regression or a combination of regressions. I tend not to use mean statistics for comparison, specifically because they can be affected by background processes on a computer. I also do not immediately understand how to compare the mean statistics in percentages (for GC) because the base (wall time) is different. It could the case that the mean in percentage is lower because the absolute wall time is bigger.

For example for the filtering case, GC mean is just exactly the same with only small improvement is you account for the wall time:

julia> 2.064 * 0.0681
0.1405584

julia> 3.044 * 0.0446
0.1357624

But the mean values in benchmarks are tricky anyway. I tend to compare the min/max values instead, I recall an article about that but I've read it long ago and don't really have a reference but the rough idea is:

  • min reflects the theoretical best performance of an algorithm, if everything goes perfect
  • max reflects the worst case scenario, if everything goes wrong
  • mean/median statistics are heavily affected by background tasks and cannot be reliably compared until you account for the skewness of the benchmark distributions (and I can see that skewness is largely different in benchmarks between 1.9 and 1.10)

I don't want to say that comparing mean/median is always a bad idea. I just want to say that it is safer to compare min/max values in majority of the cases, unless you invest much more time in proper analysis.

The max GC time is very consistent across different platforms and computers (I checked my 2 computers several times + plus your benchmark). It does not look to me as an outlier. But you are also correct saying that GC can be only one part of the story.

I checked the benchmarks with GC disabled. These are my results:

GC.enable(false)
benchmark = @benchmark run_filtering($datastream, $n, $v) samples=100
GC.enable(true)
GC.gc(true)
benchmark
# 1.9
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min  max):  1.829 ms    3.831 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.064 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.106 ms ± 271.007 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▂▃ ▇▅█ ▃▅▂▄▂                                              
  ▅▆███▃███▅█████▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃ ▃
  1.83 ms         Histogram: frequency by time        3.44 ms <

 Memory estimate: 2.35 MiB, allocs estimate: 63823.

# 1.10-beta1
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min  max):  2.568 ms    4.568 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.749 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.796 ms ± 319.726 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▁  ▁▄█▆▃                                                    
  ███▅█████▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▁▁▁▃ ▃
  2.57 ms         Histogram: frequency by time        4.31 ms <

 Memory estimate: 2.51 MiB, allocs estimate: 69824.
GC.enable(false)
benchmark = @benchmark run_smoothing($data, $n, $v) samples=100
GC.enable(true)
GC.gc(true)
benchmark
# 1.9
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min  max):  8.132 ms   11.818 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     8.832 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.997 ms ± 758.615 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂        █ ▂ ▃                                               
  █▅▇█▃▃▁▅▆███▆█▄▇▄▄▃▃▁▃▆▃▅▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▃▄▁▃▁▃▁▁▁▃▁▁▁▁▁▁▁▁▃ ▃
  8.13 ms         Histogram: frequency by time        11.8 ms <

 Memory estimate: 10.05 MiB, allocs estimate: 220417.

# 1.10-beta1
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min  max):   9.918 ms   12.661 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     10.534 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.644 ms ± 460.803 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▄█▅▃▃                                             
  ▅▁▁▅▃▅▁▅▆▇▇██████▆▅█▇▆▆▃▁▅▁▁▁▁▁▁▁▃▁▁▁▃▁▃▃▃▁▁▅▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  9.92 ms         Histogram: frequency by time         12.4 ms <

 Memory estimate: 10.12 MiB, allocs estimate: 222915.

So yes, there is some sort of regression, which looks like it is further escalated by the GC timings. Something happens which triggers GC to spend even more time in the worst case scenario. But I'm not sure how to narrow it down further. The output of the Profile is very different between versions and basically unreadable.

@gbaraldi
Copy link
Member

gbaraldi commented Aug 2, 2023

The GC might decide to coalesce the garbage collections to bring down overall time while having potentially longer pauses.

@nathanrboyer

This comment was marked as off-topic.

@gbaraldi

This comment was marked as off-topic.

@nathanrboyer

This comment was marked as off-topic.

@gbaraldi

This comment was marked as off-topic.

@gbaraldi

This comment was marked as off-topic.

@nathanrboyer

This comment was marked as off-topic.

@gbaraldi

This comment was marked as off-topic.

@vtjnash

This comment was marked as off-topic.

@gbaraldi

This comment was marked as off-topic.

@gbaraldi gbaraldi changed the title GC regression in 1.10-beta1 Performance regression in RxInfer code in 1.10-beta1 Aug 10, 2023
@gbaraldi gbaraldi removed the GC Garbage collector label Aug 10, 2023
@gbaraldi
Copy link
Member

I hid those comments because they actually belong to #50859

@KristofferC
Copy link
Member

Based on the discussion in #50704 (comment) I don't think this should really be on the milestone. It seems to be a fairly marginal performance regression and it doesn't really have any analysis of what has gotten slower. I'll remove it from the milestone for now, depending on results from some extra profiling work it could potentially be added back.

(Also, the link to the repro no longer works)

@KristofferC KristofferC removed this from the 1.10 milestone Aug 23, 2023
@bvdmitri
Copy link
Contributor Author

Ok, just in case there is a permanent link to the repro.

@bvdmitri
Copy link
Contributor Author

@KristofferC The performance regression is pretty hard to bisect. It is not very consistent between runs and pretty much heavily depends on the percentage of GC spent during the run. I did bisect, however, the increase in number of allocations and total size of allocations to 51e3bc3 , but I'm not sure what this commit did and how it could cause regression in our case.

@bvdmitri
Copy link
Contributor Author

@KristofferC
On the current master I get significantly worse results, basically there is a ~15x regression for the second benchmark. It doesn't look like it GC related as its pretty consistent (and it allocates less than in 1.10).

1.10

BenchmarkTools.Trial: 2694 samples with 1 evaluation.
 Range (min  max):  1.671 ms    3.392 ms  ┊ GC (min  max): 0.00%  46.05%
 Time  (median):     1.777 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.855 ms ± 323.736 μs  ┊ GC (mean ± σ):  4.24% ± 10.19%

  ▂▇██▇█▇▂                                          ▁ ▁       ▁
  ████████▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅███▇▇████ █
  1.67 ms      Histogram: log(frequency) by time      3.27 ms <

 Memory estimate: 2.46 MiB, allocs estimate: 67408.
BenchmarkTools.Trial: 582 samples with 1 evaluation.
 Range (min  max):  7.600 ms  12.065 ms  ┊ GC (min  max): 0.00%  32.69%
 Time  (median):     7.926 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.589 ms ±  1.282 ms  ┊ GC (mean ± σ):  7.81% ± 11.24%

    ▇█▇▅▂
  ▄███████▆▇▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▅▅▆▆▇▆▇▇▇█▅▇█▇▅▇▇▆▇█▇▅█▆▆▅▄▆▇▇▇ ▇
  7.6 ms       Histogram: log(frequency) by time     11.9 ms <

 Memory estimate: 10.39 MiB, allocs estimate: 228968.

master

BenchmarkTools.Trial: 2277 samples with 1 evaluation.
 Range (min  max):  1.939 ms   10.002 ms  ┊ GC (min  max): 0.00%  79.32%
 Time  (median):     2.090 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.195 ms ± 748.385 μs  ┊ GC (mean ± σ):  5.23% ± 10.61%

  ▅█▆
  ████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▆▇ █
  1.94 ms      Histogram: log(frequency) by time      6.91 ms <

 Memory estimate: 2.27 MiB, allocs estimate: 61943.
BenchmarkTools.Trial: 43 samples with 1 evaluation.
 Range (min  max):  115.886 ms  128.093 ms  ┊ GC (min  max): 0.00%  8.04%
 Time  (median):     117.313 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   117.835 ms ±   2.359 ms  ┊ GC (mean ± σ):  0.37% ± 1.58%

  ▂▅██▂   ▅▂
  █████▁▅█████▅▁▁▅▅▅▁▅▁▁▅▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  116 ms           Histogram: frequency by time          128 ms <
 Memory estimate: 10.16 MiB, allocs estimate: 234019.
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 11 × Apple M3 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 5 virtual cores

@bvdmitri bvdmitri changed the title Performance regression in RxInfer code in 1.10-beta1 Performance regression in RxInfer code on current master Jan 22, 2024
@bvdmitri
Copy link
Contributor Author

The script I used for testing is this

Script
using RxInfer, BenchmarkTools, Random

seed = 123

rng = MersenneTwister(seed)

v = 100.0

n = 500

hidden     = collect(1:n)
data       = hidden + rand(rng, Normal(0.0, sqrt(v)), n);
datastream = from(data) |> map(NamedTuple{(:y, ), Tuple{Float64}}, (d) -> (y = d, ));


@model function filtering(; c::ConstVariable, v::ConstVariable)

    x_prior_mean = datavar(Float64)
    x_prior_var  = datavar(Float64)

    x_prior ~ Normal(mean = x_prior_mean, variance = x_prior_var)
    x_next ~ x_prior + c

    y = datavar(Float64)
    y ~ Normal(mean = x_next, variance = v)

end

function run_filtering(datastream, n, v)

    autoupdates = @autoupdates begin
        x_prior_mean, x_prior_var = mean_var(q(x_next))
    end

    return infer(
        model       = filtering(c = 1.0, v = v),
        datastream  = datastream,
        autoupdates = autoupdates,
        initmarginals = (x_next = NormalMeanVariance(0.0, 10000.0), ),
        keephistory = n,
        historyvars = (x_next = KeepLast(), ),
        free_energy = true,
        autostart = true,
        postprocess = UnpackMarginalPostprocess()
    )

end


result_filtering = run_filtering(datastream, n, v)

@model function smoothing(n; c::ConstVariable, v::ConstVariable)

    x_prior ~ Normal(mean = 0.0, variance = 10000.0)

    x = randomvar(n)
    y = datavar(Float64, n)

    x_prev = x_prior

    for i in 1:n
        x[i] ~ x_prev + c
        y[i] ~ Normal(mean = x[i], variance = v)
        x_prev = x[i]
    end

    return x, y
end

function run_smoothing(data, n, v)
    return infer(
        model = smoothing(n, c = 1.0, v = v),
        data  = (y = data, ),
        returnvars = KeepLast(),
        free_energy = true,
        postprocess = UnpackMarginalPostprocess()
    )
end

result_smoothing = run_smoothing(data, n, v)

benchmark1 = @benchmark run_filtering($datastream, $n, $v)
benchmark2 = @benchmark run_smoothing($data, $n, $v)

show(benchmark1)
show(benchmark2)

display(benchmark1)
display(benchmark2)

@maleadt
Copy link
Member

maleadt commented Jan 23, 2024

On the current master I get significantly worse results, basically there is a ~15x regression for the second benchmark.

Bisected that to #50927 (cc @vtjnash).

@KristofferC KristofferC added this to the 1.11 milestone Jan 23, 2024
@bvdmitri
Copy link
Contributor Author

Reduced the example, seems like the problem is not during the execution of the algorithm, but during the model creation. From that perspective the model creation for both models have regressed by about a factor 20-40

# 1.10.1

 julia --project=. ~/.julia/dev/julia/script.jl
Trial(21.125 μs)Trial(3.653 ms)BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  21.125 μs   1.522 ms  ┊ GC (min  max): 0.00%  94.81%
 Time  (median):     23.375 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.275 μs ± 29.652 μs  ┊ GC (mean ± σ):  2.37% ±  1.91%

            ▃▃▇▇█▃▃▁
  ▁▁▁▁▁▁▁▁▂▆████████▆▄▄▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  21.1 μs         Histogram: frequency by time        30.3 μs <

 Memory estimate: 18.21 KiB, allocs estimate: 433.
BenchmarkTools.Trial: 1274 samples with 1 evaluation.
 Range (min  max):  3.653 ms    7.132 ms  ┊ GC (min  max): 0.00%  36.47%
 Time  (median):     3.721 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.924 ms ± 547.249 μs  ┊ GC (mean ± σ):  3.79% ±  8.58%

  ▂█▅▃▁▁
  ██████▇▆▇▆▅▅▄▅▃▅▅▃▄▃▃▁▄▁▃▃▁▅▃▁▃▃▁▅▅▁▅▅▆▆▅▆▅▆▆▄▆▆▇▇█▆▆▄▅▄▃▁▄ ▇
  3.65 ms      Histogram: log(frequency) by time      5.83 ms <

 Memory estimate: 4.05 MiB, allocs estimate: 103123

master

 ./julia --startup-file=no script.jl
Trial(384.250 μs)Trial(119.892 ms)BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  384.250 μs   4.570 ms  ┊ GC (min  max): 0.00%  88.93%
 Time  (median):     420.312 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   448.273 μs ± 97.839 μs  ┊ GC (mean ± σ):  0.09% ±  0.89%

    ▃█▇▅▃▇▆▃                                                   ▂
  ▄▄██████████▇▇▆▆▇▇▆▅▆▅▄▅▆▄▅▅▄▄▄▅▄▆▄▄▅▂▅▃▄▄▅▄▄▄▇████▇▇▇▇▇▆▆▆▆ █
  384 μs        Histogram: log(frequency) by time       849 μs <

 Memory estimate: 16.95 KiB, allocs estimate: 412.
BenchmarkTools.Trial: 38 samples with 1 evaluation.
 Range (min  max):  119.892 ms  150.380 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     134.175 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   134.351 ms ±   8.712 ms  ┊ GC (mean ± σ):  0.10% ± 0.66%

                  █            ▃    ▃        █  ▃        ▃
  ▇▇▇▇▇▁▁▁▁▁▇▁▇▇▁▇█▇▇▇▁▇▇▁▁▁▇▁▇█▇▁▁▁█▁▁▁▁▁▇▇▇█▁▁█▇▁▇▁▁▁▁▁█▇▁▁▁▇ ▁
  120 ms           Histogram: frequency by time          150 ms <

 Memory estimate: 3.83 MiB, allocs estimate: 99130.
script.jl
using RxInfer, BenchmarkTools, Random

@model function filtering(; c::ConstVariable, v::ConstVariable)

    x_prior_mean = datavar(Float64)
    x_prior_var  = datavar(Float64)

    x_prior ~ Normal(mean = x_prior_mean, variance = x_prior_var)
    x_next ~ x_prior + c

    y = datavar(Float64)
    y ~ Normal(mean = x_next, variance = v)

end

function run_filtering()
    return RxInfer.create_model(filtering(c = 1.0, v = 1.0))
end

@model function smoothing(n; c::ConstVariable, v::ConstVariable)

    x_prior ~ Normal(mean = 0.0, variance = 10000.0)

    x = randomvar(n)
    y = datavar(Float64, n)

    x_prev = x_prior

    for i in 1:n
        x[i] ~ x_prev + c
        y[i] ~ Normal(mean = x[i], variance = v)
        x_prev = x[i]
    end

    return x, y
end

function run_smoothing(n)
    return RxInfer.create_model(smoothing(n, c = 1.0, v = 1.0))
end

benchmark1 = @benchmark run_filtering()
benchmark2 = @benchmark run_smoothing(500)

show(benchmark1)
show(benchmark2)

display(benchmark1)
display(benchmark2)

@bvdmitri
Copy link
Contributor Author

I revisited this. RxInfer.jl had a new major release last week (v3.0.0) and the script I've provided previously does not work with the new version. Here is the new script:

script.jl using RxInfer, BenchmarkTools, Random

seed = 123

rng = MersenneTwister(seed)

v = 100.0

n = 500

hidden = collect(1:n)
data = hidden + rand(rng, Normal(0.0, sqrt(v)), n);
datastream = from(data) |> map(NamedTuple{(:y, ), Tuple{Float64}}, (d) -> (y = d, ));

@model function filtering(y, x_prior_mean, x_prior_var, c, v)
x_prior ~ Normal(mean = x_prior_mean, variance = x_prior_var)
x_next ~ x_prior + c
y ~ Normal(mean = x_next, variance = v)
end

@model function smoothing(y, c, v)

x_prior ~ Normal(mean = 0.0, variance = 10000.0)
x_prev = x_prior

for i in eachindex(y)
    x[i] ~ x_prev + c
    y[i] ~ Normal(mean = x[i], variance = v)
    x_prev = x[i]
end

end

function filtering_model(y)
return RxInfer.create_model(filtering(c = 1.0, v = 1.0)|(y = y, x_prior_mean = 1.0, x_prior_var = 1.0))
end

function smoothing_model(y)
return RxInfer.create_model(smoothing(c = 1.0, v = 1.0)|(y = y,))
end

filtering_model(data)
smoothing_model(data)

benchmark1 = @benchmark filtering_model($data)
benchmark2 = @benchmark smoothing_model($data)

println("Filtering model creation")
display(benchmark1)
println("Smoothing model creation")
display(benchmark2)

function run_filtering(datastream, n, v)

autoupdates = @autoupdates begin
    x_prior_mean, x_prior_var = mean_var(q(x_next))
end

init = @initialization begin
    q(x_next) = NormalMeanVariance(0.0, 10000.0)
end

return infer(
    model          = filtering(c = 1.0, v = v),
    datastream     = datastream,
    autoupdates    = autoupdates,
    initialization = init,
    keephistory    = n,
    historyvars    = (x_next = KeepLast(), ),
    free_energy    = true,
    autostart      = true,
    postprocess    = UnpackMarginalPostprocess()
)

end

function run_smoothing(data, n, v)
return infer(
model = smoothing(c = 1.0, v = v),
data = (y = data, ),
returnvars = KeepLast(),
free_energy = true,
postprocess = UnpackMarginalPostprocess()
)
end

run_filtering(datastream, n, v)
run_smoothing(data, n, v)

benchmark3 = @benchmark run_filtering($datastream, $n, $v)
benchmark4 = @benchmark run_smoothing($data, $n, $v)

println("Filtering inference")
display(benchmark3)
println("Smoothing inference")
display(benchmark4)

Good news are that the performance is actually better on version 1.11-beta1 installed with juliaup. It's about 30% faster in the smoothing regime. I'm not entirely sure if this is because of the major version of RxInfer (we refactored a significant part of the codebase) or it was truly fixed on the Julia side.

Benchmark comparison between `1.10.2` and `1.11-beta1`

Here are the results for 1.10 (installed via juliaup):

 julia +release --startup-file=no --project=julia1.10 julia1.10/script.jl
Filtering model creation
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min  max):  7.271 μs  539.958 μs  ┊ GC (min  max): 0.00%  96.43%
 Time  (median):     7.615 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.163 μs ±  12.567 μs  ┊ GC (mean ± σ):  6.04% ±  3.94%

            ▅▂▃█▃▁▄
  ▁▁▁▁▁▂▂▄▆▇███████▇▇▄▄▄▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  7.27 μs         Histogram: frequency by time        8.73 μs <

 Memory estimate: 19.81 KiB, allocs estimate: 360.
Smoothing model creation
BenchmarkTools.Trial: 1650 samples with 1 evaluation.
 Range (min  max):  2.752 ms    6.942 ms  ┊ GC (min  max): 0.00%  54.15%
 Time  (median):     2.836 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.031 ms ± 509.820 μs  ┊ GC (mean ± σ):  5.80% ± 10.60%

  ▃▇█▆▄▁
  ███████▇▆▄▅▆▄▄▄▄▁▁▁▄▄▁▁▁▁▁▄▄▁▄▅▄▁▄▄▇▆▆▇▇█▇▇▆▇█▇▇██▇█▁▇▆▅▆▅▆ █
  2.75 ms      Histogram: log(frequency) by time      4.69 ms <

 Memory estimate: 5.69 MiB, allocs estimate: 114772.
Filtering inference
BenchmarkTools.Trial: 2588 samples with 1 evaluation.
 Range (min  max):  1.848 ms    3.529 ms  ┊ GC (min  max): 0.00%  44.64%
 Time  (median):     1.886 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.931 ms ± 238.478 μs  ┊ GC (mean ± σ):  1.94% ±  6.81%

  ▇█▆▅
  █████▇▆▄▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆█▇ █
  1.85 ms      Histogram: log(frequency) by time       3.4 ms <

 Memory estimate: 1.12 MiB, allocs estimate: 32633.
Smoothing inference
BenchmarkTools.Trial: 175 samples with 1 evaluation.
 Range (min  max):  24.887 ms  104.396 ms  ┊ GC (min  max):  0.00%  75.94%
 Time  (median):     27.753 ms               ┊ GC (median):     9.27%
 Time  (mean ± σ):   28.711 ms ±   8.320 ms  ┊ GC (mean ± σ):  11.20% ± 10.16%

    ▃█▇         ▄▁▁
  ▃▇████▃▃▁▃▁▄▅▄███▅▅▃▃▃▅▄▃▅▃▄▄▃▅▄▇▅▄▄▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  24.9 ms         Histogram: frequency by time         36.5 ms <

 Memory estimate: 26.59 MiB, allocs estimate: 613635.

And here are the results for 1.11-beta1 (installed via juliaup)

 julia +1.11 --startup-file=no --project=julia1.11 julia1.11/script.jl
Filtering model creation
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min  max):  6.808 μs   1.758 ms  ┊ GC (min  max): 0.00%  98.98%
 Time  (median):     7.133 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.006 μs ± 24.727 μs  ┊ GC (mean ± σ):  8.76% ±  3.37%

        ▅▆▇█▁
  ▁▁▂▃▆██████▆▄▄▃▃▂▂▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  6.81 μs        Histogram: frequency by time        8.79 μs <

 Memory estimate: 19.81 KiB, allocs estimate: 413.
Smoothing model creation
BenchmarkTools.Trial: 1632 samples with 1 evaluation.
 Range (min  max):  2.770 ms    7.357 ms  ┊ GC (min  max): 0.00%  58.78%
 Time  (median):     2.843 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.064 ms ± 608.889 μs  ┊ GC (mean ± σ):  6.98% ± 12.24%

  ▅█▇▂
  █████▇▆▁▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▁▄▃▅▅▄▅▆▇▆▆▅▆▅█▆█▇▇█▆▆▆▇▄▇▅▆▅▅▆ █
  2.77 ms      Histogram: log(frequency) by time      5.09 ms <

 Memory estimate: 5.75 MiB, allocs estimate: 125717.
Filtering inference
BenchmarkTools.Trial: 3090 samples with 1 evaluation.
 Range (min  max):  1.506 ms   14.943 ms  ┊ GC (min  max): 0.00%  88.95%
 Time  (median):     1.568 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.618 ms ± 620.070 μs  ┊ GC (mean ± σ):  3.08% ±  6.71%

      ▁▂▄▆▅▅▄▂▁▂▂▃▄▄▇█▇▅▃
  ▂▃▃▅████████████████████▇▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▁▁▂▂▂▂▂▂▁▂▂▁▂▂▂▂▂▂ ▄
  1.51 ms         Histogram: frequency by time        1.73 ms <

 Memory estimate: 1.07 MiB, allocs estimate: 30647.
Smoothing inference
BenchmarkTools.Trial: 227 samples with 1 evaluation.
 Range (min  max):  20.095 ms  39.775 ms  ┊ GC (min  max): 0.00%  45.73%
 Time  (median):     20.692 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.080 ms ±  3.530 ms  ┊ GC (mean ± σ):  6.15% ± 10.73%

  ▃▇█▅
  █████▆▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▄▁▆▄▇▁▆▁▄▆▄▆▁▆▆▁▆▁▆▁▄▁▁▁▁▁▁▁▁▁▁▄ ▆
  20.1 ms      Histogram: log(frequency) by time      34.8 ms <

 Memory estimate: 23.68 MiB, allocs estimate: 595580.

I've also checked the recent master (compiled from sources with make all). The results are pretty similar to 1.11-beta1.

Benchmark comparison between `1.11-beta1` and `master#53f452a7216d5e8af9ccadb4c8a86bcf61466a87` Here are the results for `1.11-beta1` (installed via `juliaup`)
 julia +1.11 --startup-file=no --project=julia1.11 julia1.11/script.jl
Filtering model creation
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min  max):  6.808 μs   1.758 ms  ┊ GC (min  max): 0.00%  98.98%
 Time  (median):     7.133 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.006 μs ± 24.727 μs  ┊ GC (mean ± σ):  8.76% ±  3.37%

        ▅▆▇█▁
  ▁▁▂▃▆██████▆▄▄▃▃▂▂▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  6.81 μs        Histogram: frequency by time        8.79 μs <

 Memory estimate: 19.81 KiB, allocs estimate: 413.
Smoothing model creation
BenchmarkTools.Trial: 1632 samples with 1 evaluation.
 Range (min  max):  2.770 ms    7.357 ms  ┊ GC (min  max): 0.00%  58.78%
 Time  (median):     2.843 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.064 ms ± 608.889 μs  ┊ GC (mean ± σ):  6.98% ± 12.24%

  ▅█▇▂
  █████▇▆▁▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▁▄▃▅▅▄▅▆▇▆▆▅▆▅█▆█▇▇█▆▆▆▇▄▇▅▆▅▅▆ █
  2.77 ms      Histogram: log(frequency) by time      5.09 ms <

 Memory estimate: 5.75 MiB, allocs estimate: 125717.
Filtering inference
BenchmarkTools.Trial: 3090 samples with 1 evaluation.
 Range (min  max):  1.506 ms   14.943 ms  ┊ GC (min  max): 0.00%  88.95%
 Time  (median):     1.568 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.618 ms ± 620.070 μs  ┊ GC (mean ± σ):  3.08% ±  6.71%

      ▁▂▄▆▅▅▄▂▁▂▂▃▄▄▇█▇▅▃
  ▂▃▃▅████████████████████▇▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▁▁▂▂▂▂▂▂▁▂▂▁▂▂▂▂▂▂ ▄
  1.51 ms         Histogram: frequency by time        1.73 ms <

 Memory estimate: 1.07 MiB, allocs estimate: 30647.
Smoothing inference
BenchmarkTools.Trial: 227 samples with 1 evaluation.
 Range (min  max):  20.095 ms  39.775 ms  ┊ GC (min  max): 0.00%  45.73%
 Time  (median):     20.692 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.080 ms ±  3.530 ms  ┊ GC (mean ± σ):  6.15% ± 10.73%

  ▃▇█▅
  █████▆▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▄▁▆▄▇▁▆▁▄▆▄▆▁▆▆▁▆▁▆▁▄▁▁▁▁▁▁▁▁▁▁▄ ▆
  20.1 ms      Histogram: log(frequency) by time      34.8 ms <

 Memory estimate: 23.68 MiB, allocs estimate: 595580.

And here are the result for the master (commit 53f452a)

 ~/Projects/Julia/julia/julia --startup-file=no julia1.12/script.jl
Filtering model creation
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min  max):  6.906 μs   1.597 ms  ┊ GC (min  max): 0.00%  98.97%
 Time  (median):     7.260 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.114 μs ± 28.223 μs  ┊ GC (mean ± σ):  8.69% ±  2.60%

       ▁▅▇██▆▃▂▁
  ▁▁▂▄▇█████████▆▅▄▄▃▄▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  6.91 μs        Histogram: frequency by time        8.75 μs <

 Memory estimate: 19.81 KiB, allocs estimate: 413.
Smoothing model creation
BenchmarkTools.Trial: 1604 samples with 1 evaluation.
 Range (min  max):  2.802 ms    8.084 ms  ┊ GC (min  max): 0.00%  62.53%
 Time  (median):     2.880 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.118 ms ± 670.430 μs  ┊ GC (mean ± σ):  7.31% ± 12.73%

  ▅█▆▂▁
  ██████▆▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▅▁▁▄▄▅▇▄▆▆▆▆▇▆▇▇▇▆▆▆▇▆▆▆█▇▆▅▄▅ █
  2.8 ms       Histogram: log(frequency) by time      5.35 ms <

 Memory estimate: 5.75 MiB, allocs estimate: 125587.
Filtering inference
BenchmarkTools.Trial: 3092 samples with 1 evaluation.
 Range (min  max):  1.528 ms   15.176 ms  ┊ GC (min  max): 0.00%  89.22%
 Time  (median):     1.559 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.616 ms ± 637.183 μs  ┊ GC (mean ± σ):  3.03% ±  6.47%

      ▂▃▆██▆▅▄
  ▂▃▄▇████████▇▇▅▅▃▃▃▃▄▄▄▅▅▅▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▂▁▂▁▁▂▂▁▁▁▂▂▂▂▂▁▂ ▃
  1.53 ms         Histogram: frequency by time        1.72 ms <

 Memory estimate: 1.07 MiB, allocs estimate: 30647.
Smoothing inference
BenchmarkTools.Trial: 228 samples with 1 evaluation.
 Range (min  max):  19.823 ms  41.105 ms  ┊ GC (min  max): 0.00%  48.46%
 Time  (median):     20.509 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.933 ms ±  3.727 ms  ┊ GC (mean ± σ):  6.33% ± 11.13%

  ▁▅█▇▄
  █████▆▆▁▁▄▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄▄▆▆▁▁▆▆▁▆▆▄▆▁▁▆▄▄▁▆▁▁▆▁▁▁▁▆ ▆
  19.8 ms      Histogram: log(frequency) by time      33.4 ms <

 Memory estimate: 23.67 MiB, allocs estimate: 595451.

I think the issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

7 participants