Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question on benchmarks for array_of_struct #414

Closed
aktau opened this issue Apr 22, 2024 · 6 comments · Fixed by #462
Closed

question on benchmarks for array_of_struct #414

aktau opened this issue Apr 22, 2024 · 6 comments · Fixed by #462
Labels
performance Performance-related stuff question Further information is requested tests Tests, benchmarks and profiling

Comments

@aktau
Copy link

aktau commented Apr 22, 2024

The graph in the README, with arche outperforming a bog-standard AoS setup really surprised me. The memory locality in the AoS case should be really good. To verify, I ran the benchmarks and found something closer in line with my expectations:

$ benchstat -col '/impl' ~/tmp/bench.arche
goarch: amd64
pkg: github.com/mlange-42/arche/benchmark/competition/array_of_structs
cpu: AMD Ryzen Threadripper PRO 3995WX 64-Cores
                         │     Arche     │            ArrOfPointers             │            ArrOfStructs             │              LinkedList               │
                         │    sec/op     │    sec/op      vs base               │    sec/op     vs base               │    sec/op      vs base                │
Fill/conf=16B_1_000-8      3301.0n ± ∞ ¹    503.0n ± ∞ ¹  -84.76% (p=0.008 n=5)   510.9n ± ∞ ¹  -84.52% (p=0.008 n=5)   1252.0n ± ∞ ¹   -62.07% (p=0.008 n=5)
Fill/conf=16B_10_000-8     32.415µ ± ∞ ¹   18.993µ ± ∞ ¹  -41.41% (p=0.008 n=5)   5.824µ ± ∞ ¹  -82.03% (p=0.008 n=5)   12.868µ ± ∞ ¹   -60.30% (p=0.008 n=5)
Fill/conf=16B_100_000-8    322.92µ ± ∞ ¹    55.82µ ± ∞ ¹  -82.71% (p=0.008 n=5)   58.83µ ± ∞ ¹  -81.78% (p=0.008 n=5)   130.03µ ± ∞ ¹   -59.73% (p=0.008 n=5)
Fill/conf=32B_1_000-8      3360.0n ± ∞ ¹    661.1n ± ∞ ¹  -80.32% (p=0.008 n=5)   557.0n ± ∞ ¹  -83.42% (p=0.008 n=5)   1365.0n ± ∞ ¹   -59.38% (p=0.008 n=5)
Fill/conf=32B_10_000-8     32.258µ ± ∞ ¹    6.797µ ± ∞ ¹  -78.93% (p=0.008 n=5)   6.585µ ± ∞ ¹  -79.59% (p=0.008 n=5)   14.178µ ± ∞ ¹   -56.05% (p=0.008 n=5)
Fill/conf=32B_100_000-8    319.83µ ± ∞ ¹    88.52µ ± ∞ ¹  -72.32% (p=0.008 n=5)   67.72µ ± ∞ ¹  -78.83% (p=0.008 n=5)   164.23µ ± ∞ ¹   -48.65% (p=0.008 n=5)
Fill/conf=64B_1_000-8      3407.0n ± ∞ ¹    877.6n ± ∞ ¹  -74.24% (p=0.008 n=5)   692.7n ± ∞ ¹  -79.67% (p=0.008 n=5)   1779.0n ± ∞ ¹   -47.78% (p=0.008 n=5)
Fill/conf=64B_10_000-8     32.483µ ± ∞ ¹    9.963µ ± ∞ ¹  -69.33% (p=0.008 n=5)   7.493µ ± ∞ ¹  -76.93% (p=0.008 n=5)   19.404µ ± ∞ ¹   -40.26% (p=0.008 n=5)
Fill/conf=64B_100_000-8    322.95µ ± ∞ ¹   106.15µ ± ∞ ¹  -67.13% (p=0.008 n=5)   84.89µ ± ∞ ¹  -73.72% (p=0.008 n=5)   204.36µ ± ∞ ¹   -36.72% (p=0.008 n=5)
Fill/conf=128B_1_000-8     3552.0n ± ∞ ¹    864.6n ± ∞ ¹  -75.66% (p=0.008 n=5)   667.0n ± ∞ ¹  -81.22% (p=0.008 n=5)   1818.0n ± ∞ ¹   -48.82% (p=0.008 n=5)
Fill/conf=128B_10_000-8     32.32µ ± ∞ ¹    11.96µ ± ∞ ¹  -63.00% (p=0.008 n=5)   10.64µ ± ∞ ¹  -67.07% (p=0.008 n=5)    26.00µ ± ∞ ¹   -19.55% (p=0.008 n=5)
Fill/conf=128B_100_000-8    319.7µ ± ∞ ¹    251.0µ ± ∞ ¹  -21.50% (p=0.008 n=5)   176.5µ ± ∞ ¹  -44.79% (p=0.008 n=5)    528.7µ ± ∞ ¹   +65.38% (p=0.008 n=5)
Fill/conf=256B_1_000-8     3459.0n ± ∞ ¹    892.2n ± ∞ ¹  -74.21% (p=0.008 n=5)   667.8n ± ∞ ¹  -80.69% (p=0.008 n=5)   1951.0n ± ∞ ¹   -43.60% (p=0.008 n=5)
Fill/conf=256B_10_000-8    32.567µ ± ∞ ¹   11.829µ ± ∞ ¹  -63.68% (p=0.008 n=5)   8.452µ ± ∞ ¹  -74.05% (p=0.008 n=5)   32.826µ ± ∞ ¹         ~ (p=0.151 n=5)
Fill/conf=256B_100_000-8    323.4µ ± ∞ ¹    610.1µ ± ∞ ¹  +88.64% (p=0.008 n=5)   574.7µ ± ∞ ¹  +77.68% (p=0.008 n=5)   1334.7µ ± ∞ ¹  +312.66% (p=0.008 n=5)
geomean                     32.90µ          10.83µ        -67.09%                 8.435µ        -74.36%                  21.55µ         -34.50%

Of note, I manually changes the benchmark names with a text editor so I could use benchstat column projection.

What did I do wrong?

@mlange-42
Copy link
Owner

Hi @aktau,
many thanks for your interest!

Honestly, I don't know what you "do wrong", as I don't know your code. Can you reproduce the original benchmarks? Maybe we should turn the question around: what do you think I did wrong?

Regarding cache locality of AoS, I think it is not that clear. Actually, the benchmarks are intended to show that cache friendlyness of AoS decreases with increasing entity size while accessing always the same small amount of data. With AoS, the entire entity data needs to be loaded into the cache, no matter how much of it is actually used. With large entities, fewer fit into the cache and more frequent access to slower cache levels is required. So I think the entity size dimension of the problem is quite clear.

What is less clear to me is why there is no saturation effect with an increasing number of entities. Well maybe there is, but only for even larger numbers.

I also did the same benchmarks for a Rust ECS (rs-ecs), and qualitatively the results are the same.

Could you share your benchmarking code for a comparison?

@aktau
Copy link
Author

aktau commented Apr 22, 2024

Honestly, I don't know what you "do wrong", as I don't know your code. Can you reproduce the original benchmarks?

I'm just running your code, unedited:

$ cd github/arche/benchmark/competition/array_of_structs
$ go test -test.bench=. -test.count=5 | tee ~/tmp/bench.arche
$ nvim ~/tmp/bench.arche # edit the benchmark names so that I can use the `-col '/impl'` flag, as stated above
$ benchstat -col '/impl' ~/tmp/bench.arche

Maybe we should turn the question around: what do you think I did wrong?

I don't think you did anything wrong. I was just surprised by the result.

Regarding cache locality of AoS, I think it is not that clear. Actually, the benchmarks are intended to show that cache friendlyness of AoS decreases with increasing entity size while accessing always the same small amount of data. With AoS, the entire entity data needs to be loaded into the cache, no matter how much of it is actually used. With large entities, fewer fit into the cache and more frequent access to slower cache levels is required. So I think the entity size dimension of the problem is quite clear.

So if I understand it right, arche uses a SoA pattern and in the loops only a small part of the struct is accessed, which is why it can pull ahead. Is that correct?

@mlange-42
Copy link
Owner

I'm just running your code, unedited:

Can you verify that the original output already shows what the benchstat table does?

So if I understand it right, arche uses a SoA pattern and in the loops only a small part of the struct is accessed, which is why it can pull ahead. Is that correct?

Yes, kind of. Most ECS implementations (archetype-based like Arche, but also sparse-set-based) use a memory layout more similar to SoA. However, the arrays (or buffers, or whatever) do not necessarily contain only primitive types, but components. A component can contain multiple variables/primitives, but for good performance they should be closely related and mostly accessed together.

You may want to take a look at the architecture section of the user manual for more details.

@aktau
Copy link
Author

aktau commented Apr 30, 2024

Can you verify that the original output already shows what the benchstat table does?

It does. I did not change the numbers. I only edited the output format so that benchstat cam render/compare them better (see the -col flag).

So if I understand it right, arche uses a SoA pattern and in the loops only a small part of the struct is accessed, which is why it can pull ahead. Is that correct?

Yes, kind of

I think that can explain the difference in performance. For very big structs, the SoA-like approach of arche would be superior, even taking the extra abstraction into account.

@mlange-42
Copy link
Owner

Would still like to find out why your results differ that much. Could you share the raw results obtained on your machine?

@mlange-42 mlange-42 added question Further information is requested performance Performance-related stuff tests Tests, benchmarks and profiling labels May 2, 2024
@mlange-42
Copy link
Owner

@aktau Looks like there are huge differences between machines. While on my local Windows machine, I can still reproduce the results shown in the README. In the CI on the other hand, we get something closer to your results. An excerpt is shown at the bottom.

I guess the primary difference is the 256MB cache of the CI machine, compared to 8MB locally.

Locally:

cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkArche_16B_1_000-8                        518306              2106 ns/op               0 B/op          0 allocs/op
BenchmarkArche_16B_10_000-8                        56714             21071 ns/op               0 B/op          0 allocs/op
BenchmarkArche_16B_100_000-8                        5271            214584 ns/op               0 B/op          0 allocs/op
...
BenchmarkArche_256B_1_000-8                       526672              2184 ns/op               0 B/op          0 allocs/op
BenchmarkArche_256B_10_000-8                       55582             21066 ns/op               0 B/op          0 allocs/op
BenchmarkArche_256B_100_000-8                       5295            213346 ns/op               0 B/op          0 allocs/op
...
BenchmarkArrOfStructs_16B_1_000-8                1814610               659.8 ns/op             0 B/op          0 allocs/op
BenchmarkArrOfStructs_16B_10_000-8                176470              6687 ns/op               0 B/op          0 allocs/op
BenchmarkArrOfStructs_16B_100_000-8                16658             71902 ns/op               0 B/op          0 allocs/op
...
BenchmarkArrOfStructs_256B_1_000-8                890492              1366 ns/op               0 B/op          0 allocs/op
BenchmarkArrOfStructs_256B_10_000-8                38503             30776 ns/op               0 B/op          0 allocs/op
BenchmarkArrOfStructs_256B_100_000-8                1138           1095108 ns/op               0 B/op          0 allocs/op

CI:

cpu: AMD EPYC 7763 64-Core Processor                
BenchmarkArche_16B_1_000-4              	  398088	      2922 ns/op	       0 B/op	       0 allocs/op
BenchmarkArche_16B_10_000-4             	   41156	     28951 ns/op	       0 B/op	       0 allocs/op
BenchmarkArche_16B_100_000-4            	    4070	    291280 ns/op	       0 B/op	       0 allocs/op
...
BenchmarkArche_256B_1_000-4             	  409652	      2848 ns/op	       0 B/op	       0 allocs/op
BenchmarkArche_256B_10_000-4            	   41116	     29150 ns/op	       0 B/op	       0 allocs/op
BenchmarkArche_256B_100_000-4           	    4096	    292116 ns/op	       0 B/op	       0 allocs/op
...
BenchmarkArrOfStructs_16B_1_000-4       	 1913967	       625.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrOfStructs_16B_10_000-4      	  193038	      6250 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrOfStructs_16B_100_000-4     	   19146	     62600 ns/op	       0 B/op	       0 allocs/op
....
BenchmarkArrOfStructs_256B_1_000-4      	 1871191	       640.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrOfStructs_256B_10_000-4     	   70363	     16944 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrOfStructs_256B_100_000-4    	    5292	    240630 ns/op	       0 B/op	       0 allocs/op

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related stuff question Further information is requested tests Tests, benchmarks and profiling
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants