question on benchmarks for array_of_struct #414

aktau · 2024-04-22T11:27:07Z

The graph in the README, with arche outperforming a bog-standard AoS setup really surprised me. The memory locality in the AoS case should be really good. To verify, I ran the benchmarks and found something closer in line with my expectations:

$ benchstat -col '/impl' ~/tmp/bench.arche
goarch: amd64
pkg: github.com/mlange-42/arche/benchmark/competition/array_of_structs
cpu: AMD Ryzen Threadripper PRO 3995WX 64-Cores
                         │     Arche     │            ArrOfPointers             │            ArrOfStructs             │              LinkedList               │
                         │    sec/op     │    sec/op      vs base               │    sec/op     vs base               │    sec/op      vs base                │
Fill/conf=16B_1_000-8      3301.0n ± ∞ ¹    503.0n ± ∞ ¹  -84.76% (p=0.008 n=5)   510.9n ± ∞ ¹  -84.52% (p=0.008 n=5)   1252.0n ± ∞ ¹   -62.07% (p=0.008 n=5)
Fill/conf=16B_10_000-8     32.415µ ± ∞ ¹   18.993µ ± ∞ ¹  -41.41% (p=0.008 n=5)   5.824µ ± ∞ ¹  -82.03% (p=0.008 n=5)   12.868µ ± ∞ ¹   -60.30% (p=0.008 n=5)
Fill/conf=16B_100_000-8    322.92µ ± ∞ ¹    55.82µ ± ∞ ¹  -82.71% (p=0.008 n=5)   58.83µ ± ∞ ¹  -81.78% (p=0.008 n=5)   130.03µ ± ∞ ¹   -59.73% (p=0.008 n=5)
Fill/conf=32B_1_000-8      3360.0n ± ∞ ¹    661.1n ± ∞ ¹  -80.32% (p=0.008 n=5)   557.0n ± ∞ ¹  -83.42% (p=0.008 n=5)   1365.0n ± ∞ ¹   -59.38% (p=0.008 n=5)
Fill/conf=32B_10_000-8     32.258µ ± ∞ ¹    6.797µ ± ∞ ¹  -78.93% (p=0.008 n=5)   6.585µ ± ∞ ¹  -79.59% (p=0.008 n=5)   14.178µ ± ∞ ¹   -56.05% (p=0.008 n=5)
Fill/conf=32B_100_000-8    319.83µ ± ∞ ¹    88.52µ ± ∞ ¹  -72.32% (p=0.008 n=5)   67.72µ ± ∞ ¹  -78.83% (p=0.008 n=5)   164.23µ ± ∞ ¹   -48.65% (p=0.008 n=5)
Fill/conf=64B_1_000-8      3407.0n ± ∞ ¹    877.6n ± ∞ ¹  -74.24% (p=0.008 n=5)   692.7n ± ∞ ¹  -79.67% (p=0.008 n=5)   1779.0n ± ∞ ¹   -47.78% (p=0.008 n=5)
Fill/conf=64B_10_000-8     32.483µ ± ∞ ¹    9.963µ ± ∞ ¹  -69.33% (p=0.008 n=5)   7.493µ ± ∞ ¹  -76.93% (p=0.008 n=5)   19.404µ ± ∞ ¹   -40.26% (p=0.008 n=5)
Fill/conf=64B_100_000-8    322.95µ ± ∞ ¹   106.15µ ± ∞ ¹  -67.13% (p=0.008 n=5)   84.89µ ± ∞ ¹  -73.72% (p=0.008 n=5)   204.36µ ± ∞ ¹   -36.72% (p=0.008 n=5)
Fill/conf=128B_1_000-8     3552.0n ± ∞ ¹    864.6n ± ∞ ¹  -75.66% (p=0.008 n=5)   667.0n ± ∞ ¹  -81.22% (p=0.008 n=5)   1818.0n ± ∞ ¹   -48.82% (p=0.008 n=5)
Fill/conf=128B_10_000-8     32.32µ ± ∞ ¹    11.96µ ± ∞ ¹  -63.00% (p=0.008 n=5)   10.64µ ± ∞ ¹  -67.07% (p=0.008 n=5)    26.00µ ± ∞ ¹   -19.55% (p=0.008 n=5)
Fill/conf=128B_100_000-8    319.7µ ± ∞ ¹    251.0µ ± ∞ ¹  -21.50% (p=0.008 n=5)   176.5µ ± ∞ ¹  -44.79% (p=0.008 n=5)    528.7µ ± ∞ ¹   +65.38% (p=0.008 n=5)
Fill/conf=256B_1_000-8     3459.0n ± ∞ ¹    892.2n ± ∞ ¹  -74.21% (p=0.008 n=5)   667.8n ± ∞ ¹  -80.69% (p=0.008 n=5)   1951.0n ± ∞ ¹   -43.60% (p=0.008 n=5)
Fill/conf=256B_10_000-8    32.567µ ± ∞ ¹   11.829µ ± ∞ ¹  -63.68% (p=0.008 n=5)   8.452µ ± ∞ ¹  -74.05% (p=0.008 n=5)   32.826µ ± ∞ ¹         ~ (p=0.151 n=5)
Fill/conf=256B_100_000-8    323.4µ ± ∞ ¹    610.1µ ± ∞ ¹  +88.64% (p=0.008 n=5)   574.7µ ± ∞ ¹  +77.68% (p=0.008 n=5)   1334.7µ ± ∞ ¹  +312.66% (p=0.008 n=5)
geomean                     32.90µ          10.83µ        -67.09%                 8.435µ        -74.36%                  21.55µ         -34.50%

Of note, I manually changes the benchmark names with a text editor so I could use benchstat column projection.

What did I do wrong?

The text was updated successfully, but these errors were encountered:

mlange-42 · 2024-04-22T13:53:30Z

Hi @aktau,
many thanks for your interest!

Honestly, I don't know what you "do wrong", as I don't know your code. Can you reproduce the original benchmarks? Maybe we should turn the question around: what do you think I did wrong?

Regarding cache locality of AoS, I think it is not that clear. Actually, the benchmarks are intended to show that cache friendlyness of AoS decreases with increasing entity size while accessing always the same small amount of data. With AoS, the entire entity data needs to be loaded into the cache, no matter how much of it is actually used. With large entities, fewer fit into the cache and more frequent access to slower cache levels is required. So I think the entity size dimension of the problem is quite clear.

What is less clear to me is why there is no saturation effect with an increasing number of entities. Well maybe there is, but only for even larger numbers.

I also did the same benchmarks for a Rust ECS (rs-ecs), and qualitatively the results are the same.

Could you share your benchmarking code for a comparison?

aktau · 2024-04-22T14:17:56Z

Honestly, I don't know what you "do wrong", as I don't know your code. Can you reproduce the original benchmarks?

I'm just running your code, unedited:

$ cd github/arche/benchmark/competition/array_of_structs
$ go test -test.bench=. -test.count=5 | tee ~/tmp/bench.arche
$ nvim ~/tmp/bench.arche # edit the benchmark names so that I can use the `-col '/impl'` flag, as stated above
$ benchstat -col '/impl' ~/tmp/bench.arche

Maybe we should turn the question around: what do you think I did wrong?

I don't think you did anything wrong. I was just surprised by the result.

Regarding cache locality of AoS, I think it is not that clear. Actually, the benchmarks are intended to show that cache friendlyness of AoS decreases with increasing entity size while accessing always the same small amount of data. With AoS, the entire entity data needs to be loaded into the cache, no matter how much of it is actually used. With large entities, fewer fit into the cache and more frequent access to slower cache levels is required. So I think the entity size dimension of the problem is quite clear.

So if I understand it right, arche uses a SoA pattern and in the loops only a small part of the struct is accessed, which is why it can pull ahead. Is that correct?

mlange-42 · 2024-04-22T15:35:32Z

I'm just running your code, unedited:

Can you verify that the original output already shows what the benchstat table does?

So if I understand it right, arche uses a SoA pattern and in the loops only a small part of the struct is accessed, which is why it can pull ahead. Is that correct?

Yes, kind of. Most ECS implementations (archetype-based like Arche, but also sparse-set-based) use a memory layout more similar to SoA. However, the arrays (or buffers, or whatever) do not necessarily contain only primitive types, but components. A component can contain multiple variables/primitives, but for good performance they should be closely related and mostly accessed together.

You may want to take a look at the architecture section of the user manual for more details.

aktau · 2024-04-30T15:02:31Z

Can you verify that the original output already shows what the benchstat table does?

It does. I did not change the numbers. I only edited the output format so that benchstat cam render/compare them better (see the -col flag).

So if I understand it right, arche uses a SoA pattern and in the loops only a small part of the struct is accessed, which is why it can pull ahead. Is that correct?

Yes, kind of

I think that can explain the difference in performance. For very big structs, the SoA-like approach of arche would be superior, even taking the extra abstraction into account.

mlange-42 · 2024-04-30T19:47:24Z

Would still like to find out why your results differ that much. Could you share the raw results obtained on your machine?

mlange-42 · 2025-01-06T17:05:24Z

@aktau Looks like there are huge differences between machines. While on my local Windows machine, I can still reproduce the results shown in the README. In the CI on the other hand, we get something closer to your results. An excerpt is shown at the bottom.

I guess the primary difference is the 256MB cache of the CI machine, compared to 8MB locally.

Locally:

cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkArche_16B_1_000-8                        518306              2106 ns/op               0 B/op          0 allocs/op
BenchmarkArche_16B_10_000-8                        56714             21071 ns/op               0 B/op          0 allocs/op
BenchmarkArche_16B_100_000-8                        5271            214584 ns/op               0 B/op          0 allocs/op
...
BenchmarkArche_256B_1_000-8                       526672              2184 ns/op               0 B/op          0 allocs/op
BenchmarkArche_256B_10_000-8                       55582             21066 ns/op               0 B/op          0 allocs/op
BenchmarkArche_256B_100_000-8                       5295            213346 ns/op               0 B/op          0 allocs/op
...
BenchmarkArrOfStructs_16B_1_000-8                1814610               659.8 ns/op             0 B/op          0 allocs/op
BenchmarkArrOfStructs_16B_10_000-8                176470              6687 ns/op               0 B/op          0 allocs/op
BenchmarkArrOfStructs_16B_100_000-8                16658             71902 ns/op               0 B/op          0 allocs/op
...
BenchmarkArrOfStructs_256B_1_000-8                890492              1366 ns/op               0 B/op          0 allocs/op
BenchmarkArrOfStructs_256B_10_000-8                38503             30776 ns/op               0 B/op          0 allocs/op
BenchmarkArrOfStructs_256B_100_000-8                1138           1095108 ns/op               0 B/op          0 allocs/op

CI:

cpu: AMD EPYC 7763 64-Core Processor                
BenchmarkArche_16B_1_000-4              	  398088	      2922 ns/op	       0 B/op	       0 allocs/op
BenchmarkArche_16B_10_000-4             	   41156	     28951 ns/op	       0 B/op	       0 allocs/op
BenchmarkArche_16B_100_000-4            	    4070	    291280 ns/op	       0 B/op	       0 allocs/op
...
BenchmarkArche_256B_1_000-4             	  409652	      2848 ns/op	       0 B/op	       0 allocs/op
BenchmarkArche_256B_10_000-4            	   41116	     29150 ns/op	       0 B/op	       0 allocs/op
BenchmarkArche_256B_100_000-4           	    4096	    292116 ns/op	       0 B/op	       0 allocs/op
...
BenchmarkArrOfStructs_16B_1_000-4       	 1913967	       625.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrOfStructs_16B_10_000-4      	  193038	      6250 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrOfStructs_16B_100_000-4     	   19146	     62600 ns/op	       0 B/op	       0 allocs/op
....
BenchmarkArrOfStructs_256B_1_000-4      	 1871191	       640.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrOfStructs_256B_10_000-4     	   70363	     16944 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrOfStructs_256B_100_000-4    	    5292	    240630 ns/op	       0 B/op	       0 allocs/op

mlange-42 added question Further information is requested performance Performance-related stuff tests Tests, benchmarks and profiling labels May 2, 2024

mlange-42 mentioned this issue Jan 6, 2025

Update benchmarks and README #462

Merged

mlange-42 closed this as completed in #462 Jan 8, 2025

mlange-42 closed this as completed in b5e4acc Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question on benchmarks for array_of_struct #414

question on benchmarks for array_of_struct #414

aktau commented Apr 22, 2024

mlange-42 commented Apr 22, 2024

aktau commented Apr 22, 2024

mlange-42 commented Apr 22, 2024

aktau commented Apr 30, 2024

mlange-42 commented Apr 30, 2024

mlange-42 commented Jan 6, 2025

question on benchmarks for array_of_struct #414

question on benchmarks for array_of_struct #414

Comments

aktau commented Apr 22, 2024

mlange-42 commented Apr 22, 2024

aktau commented Apr 22, 2024

mlange-42 commented Apr 22, 2024

aktau commented Apr 30, 2024

mlange-42 commented Apr 30, 2024

mlange-42 commented Jan 6, 2025