Align destination in x86_64's mem* instructions. #474

Demindiro · 2022-07-03T10:31:27Z

While misaligned reads are generally fast, misaligned writes aren't and can have severe penalties.

I don't know if LLVM 9 is still supported. I've used Intel syntax anyways since it's more readable IMO.

Benchmark results on a Ryzen 2700X:

master

test memset_rust_1048576              ... bench:      18,450 ns/iter (+/- 7,933) = 56833 MB/s
test memset_rust_1048576_offset       ... bench:      36,641 ns/iter (+/- 4,096) = 28617 MB/s
test memset_rust_4096                 ... bench:          93 ns/iter (+/- 3) = 44043 MB/s
test memset_rust_4096_offset          ... bench:         144 ns/iter (+/- 4) = 28444 MB/s

test memcpy_rust_1048576              ... bench:      56,233 ns/iter (+/- 8,037) = 18646 MB/s
test memcpy_rust_1048576_misalign     ... bench:      63,518 ns/iter (+/- 3,180) = 16508 MB/s
test memcpy_rust_1048576_offset       ... bench:      57,239 ns/iter (+/- 9,745) = 18319 MB/s
test memcpy_rust_4096                 ... bench:          87 ns/iter (+/- 12) = 47080 MB/s
test memcpy_rust_4096_misalign        ... bench:         161 ns/iter (+/- 10) = 25440 MB/s
test memcpy_rust_4096_offset          ... bench:         164 ns/iter (+/- 12) = 24975 MB/s

test memmove_rust_1048576             ... bench:      53,519 ns/iter (+/- 5,885) = 19592 MB/s
test memmove_rust_1048576_misalign    ... bench:     102,426 ns/iter (+/- 3,702) = 10237 MB/s
test memmove_rust_4096                ... bench:         234 ns/iter (+/- 31) = 17504 MB/s
test memmove_rust_4096_misalign       ... bench:         425 ns/iter (+/- 5) = 9637 MB/s

x86_64-mem-align-dest

test memset_rust_1048576              ... bench:      18,862 ns/iter (+/- 4,941) = 55591 MB/s
test memset_rust_1048576_offset       ... bench:      19,768 ns/iter (+/- 4,017) = 53044 MB/s
test memset_rust_4096                 ... bench:          93 ns/iter (+/- 8) = 44043 MB/s
test memset_rust_4096_offset          ... bench:         106 ns/iter (+/- 12) = 38641 MB/s

test memcpy_rust_1048576              ... bench:      54,790 ns/iter (+/- 5,362) = 19138 MB/s
test memcpy_rust_1048576_misalign     ... bench:      58,650 ns/iter (+/- 5,727) = 17878 MB/s
test memcpy_rust_1048576_offset       ... bench:      55,283 ns/iter (+/- 7,114) = 18967 MB/s
test memcpy_rust_4096                 ... bench:          87 ns/iter (+/- 23) = 47080 MB/s
test memcpy_rust_4096_misalign        ... bench:          99 ns/iter (+/- 20) = 41373 MB/s
test memcpy_rust_4096_offset          ... bench:         106 ns/iter (+/- 11) = 38641 MB/s

test memmove_rust_1048576             ... bench:      52,964 ns/iter (+/- 3,015) = 19797 MB/s
test memmove_rust_1048576_misalign    ... bench:      52,006 ns/iter (+/- 12,560) = 20162 MB/s
test memmove_rust_4096                ... bench:         223 ns/iter (+/- 12) = 18367 MB/s
test memmove_rust_4096_misalign       ... bench:         231 ns/iter (+/- 25) = 17731 MB/s

builtin

test memset_builtin_1048576           ... bench:      19,972 ns/iter (+/- 5,495) = 52502 MB/s
test memset_builtin_1048576_offset    ... bench:      16,349 ns/iter (+/- 6,467) = 64137 MB/s
test memset_builtin_4096              ... bench:          67 ns/iter (+/- 14) = 61134 MB/s
test memset_builtin_4096_offset       ... bench:          68 ns/iter (+/- 19) = 60235 MB/s

test memcpy_builtin_1048576           ... bench:      22,621 ns/iter (+/- 186) = 46354 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      31,836 ns/iter (+/- 6,460) = 32936 MB/s
test memcpy_builtin_1048576_offset    ... bench:      28,163 ns/iter (+/- 5,183) = 37232 MB/s
test memcpy_builtin_4096              ... bench:          68 ns/iter (+/- 10) = 60235 MB/s
test memcpy_builtin_4096_misalign     ... bench:          69 ns/iter (+/- 1) = 59362 MB/s
test memcpy_builtin_4096_offset       ... bench:          69 ns/iter (+/- 13) = 59362 MB/s

test memmove_builtin_1048576          ... bench:      28,480 ns/iter (+/- 7,325) = 36817 MB/s
test memmove_builtin_1048576_misalign ... bench:      34,861 ns/iter (+/- 3,024) = 30078 MB/s
test memmove_builtin_4096             ... bench:          66 ns/iter (+/- 10) = 62060 MB/s
test memmove_builtin_4096_misalign    ... bench:          72 ns/iter (+/- 15) = 56888 MB/s

Demindiro · 2022-07-03T10:38:33Z

The PowerPC test failing is odd, I don't think it should be affected by any changes in this PR?

Demindiro · 2022-07-03T12:33:20Z

I can't reproduce the PowerPC64 failure. I think it's a bug in LLVM that may have been fixed by now.

I run the tests like so:

CC_powerpc64_unknown_linux_gnu=powerpc64-linux-gnu-gcc CARGO_TARGET_POWERPC64_UNKNOWN_LINUX_GNU_LINKER=powerpc64-linux-gnu-gcc CARGO_TARGET_POWERPC64_UNKNOWN_LINUX_GNU_RUNNER=qemu-ppc64-static QEMU_LD_PREFIX=/usr/powerpc64-linux-gnu RUST_TEST_THREADS=1 cargo t --target powerpc64-unknown-linux-gnu

The versions of rustc, gcc and QEMU are:

rustc 1.64.0-nightly (2f3ddd9f5 2022-06-27)
powerpc64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110
qemu-ppc64 version 5.2.0 (Debian 1:5.2+dfsg-11+deb11u1)

EDIT: updated to rustc 1.64.0-nightly (f2d93935f 2022-07-02), still no luck even though it's the same as what the CI uses.

bjorn3 · 2022-07-03T14:52:38Z

As far as I understand it the current implementation is actually faster for memcpy and memset on Intel. For backwards memmove using unaligned bytes is indeed slower and as such we already use 8 byte aligned operations. See https://docs.google.com/spreadsheets/d/1H-ubR-xCJWomUYDI9D2JH19BNUD7R9kfkl_OHSv6vMk/edit, which is linked from #365.

Demindiro · 2022-07-03T15:05:17Z

For backwards memmove using unaligned bytes is indeed slower and as such we already use 8 byte aligned operations

copy_backwards does not align to 8 bytes before executing rep movsq.

From what I understand recent Intel and Zen3 processors have the ERMSB feature which makes using only rep movsb fast. This is already special-cased if target_feature = "ermsb" is set.

I'll see if I can do a benchmark on an Intel processor which does not have the ERMSB feature.

Demindiro · 2022-07-03T19:25:19Z

I don't seem to have any Intel CPU without ERMSB, but I figured I'd do a benchmark on one anyways.

Benchmark results for a i5-5287U (Macbook with MacOS):

master

test memset_rust_1048576              ... bench:      35,380 ns/iter (+/- 5,678) = 29637 MB/s
test memset_rust_1048576_offset       ... bench:      34,749 ns/iter (+/- 938) = 30175 MB/s
test memset_rust_4096                 ... bench:          66 ns/iter (+/- 8) = 62060 MB/s
test memset_rust_4096_offset          ... bench:         109 ns/iter (+/- 13) = 37577 MB/s

test memcpy_rust_1048576              ... bench:      68,431 ns/iter (+/- 42,373) = 15323 MB/s
test memcpy_rust_1048576_misalign     ... bench:      99,567 ns/iter (+/- 35,795) = 10531 MB/s
test memcpy_rust_1048576_offset       ... bench:      71,338 ns/iter (+/- 50,645) = 14698 MB/s
test memcpy_rust_4096                 ... bench:          73 ns/iter (+/- 1) = 56109 MB/s
test memcpy_rust_4096_misalign        ... bench:         109 ns/iter (+/- 12) = 37577 MB/s
test memcpy_rust_4096_offset          ... bench:         126 ns/iter (+/- 186) = 32507 MB/s

test memmove_rust_1048576             ... bench:      78,784 ns/iter (+/- 8,777) = 13309 MB/s
test memmove_rust_1048576_misalign    ... bench:     168,847 ns/iter (+/- 8,877) = 6210 MB/s
test memmove_rust_4096                ... bench:         209 ns/iter (+/- 7) = 19598 MB/s
test memmove_rust_4096_misalign       ... bench:         250 ns/iter (+/- 45) = 16384 MB/s

x86_64-mem-align-dest

test memset_rust_1048576              ... bench:      35,106 ns/iter (+/- 3,536) = 29868 MB/s
test memset_rust_1048576_offset       ... bench:      35,186 ns/iter (+/- 997) = 29800 MB/s
test memset_rust_4096                 ... bench:          55 ns/iter (+/- 21) = 74472 MB/s
test memset_rust_4096_offset          ... bench:         106 ns/iter (+/- 3) = 38641 MB/s

test memcpy_rust_1048576              ... bench:      62,781 ns/iter (+/- 7,814) = 16702 MB/s
test memcpy_rust_1048576_misalign     ... bench:      83,916 ns/iter (+/- 4,891) = 12495 MB/s
test memcpy_rust_1048576_offset       ... bench:      64,644 ns/iter (+/- 8,535) = 16220 MB/s
test memcpy_rust_4096                 ... bench:          62 ns/iter (+/- 5) = 66064 MB/s
test memcpy_rust_4096_misalign        ... bench:         117 ns/iter (+/- 2) = 35008 MB/s
test memcpy_rust_4096_offset          ... bench:         122 ns/iter (+/- 20) = 33573 MB/s

test memmove_rust_1048576             ... bench:      78,933 ns/iter (+/- 30,644) = 13284 MB/s
test memmove_rust_1048576_misalign    ... bench:      82,033 ns/iter (+/- 15,301) = 12782 MB/s
test memmove_rust_4096                ... bench:         225 ns/iter (+/- 33) = 18204 MB/s
test memmove_rust_4096_misalign       ... bench:         245 ns/iter (+/- 50) = 16718 MB/s

builtin

test memset_builtin_1048576           ... bench:      34,678 ns/iter (+/- 1,803) = 30237 MB/s
test memset_builtin_1048576_offset    ... bench:      34,760 ns/iter (+/- 3,794) = 30166 MB/s
test memset_builtin_4096              ... bench:          44 ns/iter (+/- 11) = 93090 MB/s
test memset_builtin_4096_offset       ... bench:          44 ns/iter (+/- 12) = 93090 MB/s

test memcpy_builtin_1048576           ... bench:      61,564 ns/iter (+/- 8,961) = 17032 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      76,861 ns/iter (+/- 32,761) = 13642 MB/s
test memcpy_builtin_1048576_offset    ... bench:      64,832 ns/iter (+/- 12,701) = 16173 MB/s
test memcpy_builtin_4096              ... bench:          48 ns/iter (+/- 2) = 85333 MB/s
test memcpy_builtin_4096_misalign     ... bench:          63 ns/iter (+/- 15) = 65015 MB/s
test memcpy_builtin_4096_offset       ... bench:          48 ns/iter (+/- 1) = 85333 MB/s

test memmove_builtin_1048576          ... bench:      70,972 ns/iter (+/- 9,459) = 14774 MB/s
test memmove_builtin_1048576_misalign ... bench:      74,613 ns/iter (+/- 5,180) = 14053 MB/s
test memmove_builtin_4096             ... bench:          45 ns/iter (+/- 3) = 91022 MB/s
test memmove_builtin_4096_misalign    ... bench:          58 ns/iter (+/- 6) = 70620 MB/s

-C target-cpu=native (uses ERMSB version)

test memset_rust_1048576              ... bench:      34,632 ns/iter (+/- 969) = 30277 MB/s
test memset_rust_1048576_offset       ... bench:      34,718 ns/iter (+/- 2,995) = 30202 MB/s
test memset_rust_4096                 ... bench:          44 ns/iter (+/- 3) = 93090 MB/s
test memset_rust_4096_offset          ... bench:          82 ns/iter (+/- 11) = 49951 MB/s

test memcpy_rust_1048576              ... bench:      66,104 ns/iter (+/- 8,826) = 15862 MB/s
test memcpy_rust_1048576_misalign     ... bench:     100,270 ns/iter (+/- 67,361) = 10457 MB/s
test memcpy_rust_1048576_offset       ... bench:      70,141 ns/iter (+/- 10,951) = 14949 MB/s
test memcpy_rust_4096                 ... bench:          53 ns/iter (+/- 4) = 77283 MB/s
test memcpy_rust_4096_misalign        ... bench:         100 ns/iter (+/- 36) = 40960 MB/s
test memcpy_rust_4096_offset          ... bench:          91 ns/iter (+/- 9) = 45010 MB/s

test memmove_rust_1048576             ... bench:      79,976 ns/iter (+/- 36,904) = 13111 MB/s
test memmove_rust_1048576_misalign    ... bench:      92,412 ns/iter (+/- 34,262) = 11346 MB/s
test memmove_rust_4096                ... bench:         227 ns/iter (+/- 25) = 18044 MB/s
test memmove_rust_4096_misalign       ... bench:         245 ns/iter (+/- 71) = 16718 MB/s

It seems only memmove_rust_1048576_misalign benefits, all other cases seem to be as fast as before this change. It is also clear that rep movsb with ERMSB is indeed faster for memcpy. Sadly Zen 1 CPUs do not have it.

Overall, aligning the destination beforehand seems to have no measurable negative impact.

josephlr

Overall LGTM. However, should we add some unit tests for rep_param and rep_param_rev

I would just want sanity checks that:
pre_byte_count + 8*qword_count + byte_count == count

josephlr · 2022-07-07T10:31:05Z

src/mem/x86_64.rs

+            "rep stosb",
+            inout("ecx") pre_byte_count => _,
+            inout("rdi") dest => dest,
+            in("al") c,


Instead of passing in different values for al and rax we can just pull the multiplication to the top and pass the same rax value for each asm block.

See: https://rust.godbolt.org/z/9hrv8eq1G

This seems to make it easier for the compiler to combine these blocks.

josephlr · 2022-07-07T10:35:04Z

src/mem/x86_64.rs

+pub unsafe fn copy_forward(mut dest: *mut u8, mut src: *const u8, count: usize) {
+    let (pre_byte_count, qword_count, byte_count) = rep_param(dest, count);
+    // Separating the blocks gives the compiler more freedom to reorder instructions.
+    // It also allows us to trivially skip the rep movsb, which is faster when memcpying
+    // aligned data.
+    if pre_byte_count > 0 {
+        asm!(
+            "rep movsb",
+            inout("ecx") pre_byte_count => _,
+            inout("rdi") dest => dest,
+            inout("rsi") src => src,
+            options(nostack, preserves_flags)
+        );
+    }
+    asm!(
+        "rep movsq",
        inout("rcx") qword_count => _,
-        inout("rdi") dest => _,
-        inout("rsi") src => _,
-        options(att_syntax, nostack, preserves_flags)
+        inout("rdi") dest => dest,
+        inout("rsi") src => src,
+        options(nostack, preserves_flags)
    );
+    if byte_count > 0 {
+        asm!(
+            "rep movsb",
+            inout("ecx") byte_count => _,
+            inout("rdi") dest => _,
+            inout("rsi") src => _,
+            options(nostack, preserves_flags)
+        );
+    }


Assembly looks reasonable here: https://rust.godbolt.org/z/Ejd8Kv6rb

josephlr · 2022-07-07T10:42:19Z

src/mem/x86_64.rs

+/// Determine optimal parameters for a `rep` instruction.
+fn rep_param(dest: *mut u8, mut count: usize) -> (usize, usize, usize) {
+    // Unaligned writes are still slow on modern processors, so align the destination address.
+    let pre_byte_count = ((8 - (dest as usize & 0b111)) & 0b111).min(count);
+    count -= pre_byte_count;
+    let qword_count = count >> 3;
+    let byte_count = count & 0b111;
+    (pre_byte_count, qword_count, byte_count)
+}
+
+/// Determine optimal parameters for a reverse `rep` instruction (i.e. direction bit is set).
+fn rep_param_rev(dest: *mut u8, mut count: usize) -> (usize, usize, usize) {
+    // Unaligned writes are still slow on modern processors, so align the destination address.
+    let pre_byte_count = ((dest as usize + count) & 0b111).min(count);
+    count -= pre_byte_count;
+    let qword_count = count >> 3;
+    let byte_count = count & 0b111;
+    (pre_byte_count, qword_count, byte_count)
+}


I'm confused why we need two of these functions, can't we just have one rep_param function and just use the output in reverse order for copy_backward?

If we did that, it might be worth calling these: before_byte_count, qword_count, after_byte_count

I actually just didn't think of that 😅

josephlr · 2022-07-07T10:43:11Z

src/mem/x86_64.rs

+    let (pre_byte_count, qword_count, byte_count) = rep_param_rev(dest, count);
+    // We can't separate this block due to std/cld
+    asm!(
        "std",
-        "repe movsq (%rsi), (%rdi)",
-        "movl {byte_count:e}, %ecx",
-        "addq $7, %rdi",
-        "addq $7, %rsi",
-        "repe movsb (%rsi), (%rdi)",
+        "rep movsb",
+        "sub rsi, 7",
+        "sub rdi, 7",
+        "mov rcx, {qword_count}",
+        "rep movsq",
+        "add rsi, 7",
+        "add rdi, 7",
+        "mov ecx, {byte_count:e}",
+        "rep movsb",
        "cld",
        byte_count = in(reg) byte_count,
-        inout("rcx") qword_count => _,
-        inout("rdi") dest.add(count).wrapping_sub(8) => _,
-        inout("rsi") src.add(count).wrapping_sub(8) => _,
-        options(att_syntax, nostack)
+        qword_count = in(reg) qword_count,
+        inout("ecx") pre_byte_count => _,
+        inout("rdi") dest.add(count - 1) => _,
+        inout("rsi") src.add(count - 1) => _,
+        // We modify flags, but we restore it afterwards
+        options(nostack, preserves_flags)


ASM looks reasonable: https://rust.godbolt.org/z/EaGe1vM5b

josephlr · 2022-07-07T10:44:42Z

src/mem/x86_64.rs

-        "addq $7, %rdi",
-        "addq $7, %rsi",
-        "repe movsb (%rsi), (%rdi)",
+        "rep movsb",


Do we want to skip the rep movsb if the count is zero (like we do for the other functions?

josephlr · 2022-07-07T10:49:22Z

I don't know if LLVM 9 is still supported. I've used Intel syntax anyways since it's more readable IMO.

rust-lang/rust#83387 make the minimum LLVM version 10, so I think this is fine.

Can we have the switch to Intel Syntax be in a standalone Commit/PR? We should switch all the ASM to intel style, rather that just the functions in this PR.

Demindiro · 2022-07-07T11:47:31Z

However, should we add some unit tests for rep_param and rep_param_rev

Where should I put them? Putting them directly in src/mem/x86_64.rs doesn't work (testcrate doesn't run them and running cargo t from the root causes an error). Adding them to the testcrate would work but would also be a bit ugly IMO since you'd need a bunch of cfg(target_arch = "x86_64") + make rep_param public (if tests are enabled).

Amanieu · 2022-07-28T16:20:31Z

src/mem/x86_64.rs

+    // Separating the blocks gives the compiler more freedom to reorder instructions.
+    // It also allows us to trivially skip the rep stosb, which is faster when memcpying
+    // aligned data.
+    if pre_byte_count > 0 {


I would expect a rep stosb with an ecx value of 0 to act like a no-op anyways. Is there a perf benefit to keeping the if here?

There is a measurable benefit for memcpy_rust_4096 on my machine at least:

with branch:

test memcpy_rust_1048576 ... bench: 53,173 ns/iter (+/- 644) = 19720 MB/s test memcpy_rust_1048576_misalign ... bench: 58,352 ns/iter (+/- 5,939) = 17969 MB/s test memcpy_rust_1048576_offset ... bench: 52,561 ns/iter (+/- 1,950) = 19949 MB/s test memcpy_rust_4096 ... bench: 84 ns/iter (+/- 20) = 48761 MB/s test memcpy_rust_4096_misalign ... bench: 96 ns/iter (+/- 2) = 42666 MB/s test memcpy_rust_4096_offset ... bench: 97 ns/iter (+/- 0) = 42226 MB/s

without branch:

test memcpy_rust_1048576 ... bench: 55,051 ns/iter (+/- 4,696) = 19047 MB/s test memcpy_rust_1048576_misalign ... bench: 57,791 ns/iter (+/- 545) = 18144 MB/s test memcpy_rust_1048576_offset ... bench: 53,902 ns/iter (+/- 1,893) = 19453 MB/s test memcpy_rust_4096 ... bench: 89 ns/iter (+/- 0) = 46022 MB/s test memcpy_rust_4096_misalign ... bench: 97 ns/iter (+/- 1) = 42226 MB/s test memcpy_rust_4096_offset ... bench: 97 ns/iter (+/- 0) = 42226 MB/s

(Ditto for memset)

It probably makes more sense to leave it out though.

Amanieu

LGTM!

The PowerPC failure is fixed in CI, you can rebase onto the latest master.

While misaligned reads are generally fast, misaligned writes aren't and can have severe penalties.

There is currently no measureable performance difference in benchmarks but it likely will make a difference in real workloads.

While it is measurably faster for older CPUs, removing them keeps the code smaller and is likely more beneficial for newer CPUs.

Demindiro force-pushed the x86_64-mem-align-dest branch from 7b1d90f to db0ca0c Compare July 3, 2022 12:26

josephlr reviewed Jul 7, 2022

View reviewed changes

Amanieu mentioned this pull request Jul 28, 2022

add weak linkage to the ARM AEABI division functions #478

Merged

Amanieu reviewed Jul 28, 2022

View reviewed changes

Demindiro added 5 commits July 28, 2022 18:32

Align destination in mem* instructions.

c30322a

While misaligned reads are generally fast, misaligned writes aren't and can have severe penalties.

Fix suboptimal codegen in memset

314354d

Remove rep_param_rev

a1dd5a8

Use att_syntax for now

387f83e

Skip rep movsb in copy_backward if possible

ae557bd

There is currently no measureable performance difference in benchmarks but it likely will make a difference in real workloads.

Demindiro force-pushed the x86_64-mem-align-dest branch from f03ed0e to ef37a23 Compare July 28, 2022 16:45

Remove branches around rep movsb/stosb

ef37a23

While it is measurably faster for older CPUs, removing them keeps the code smaller and is likely more beneficial for newer CPUs.

Amanieu merged commit 9dfe467 into rust-lang:master Jul 28, 2022

Demindiro deleted the x86_64-mem-align-dest branch July 28, 2022 20:52

Align destination in x86_64's mem* instructions. #474

Align destination in x86_64's mem* instructions. #474

Uh oh!

Conversation

Demindiro commented Jul 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Demindiro commented Jul 3, 2022

Uh oh!

Demindiro commented Jul 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjorn3 commented Jul 3, 2022

Uh oh!

Demindiro commented Jul 3, 2022

Uh oh!

Demindiro commented Jul 3, 2022

Uh oh!

josephlr left a comment

Choose a reason for hiding this comment

Uh oh!

josephlr Jul 7, 2022

Choose a reason for hiding this comment

Uh oh!

josephlr Jul 7, 2022

Choose a reason for hiding this comment

Uh oh!

josephlr Jul 7, 2022

Choose a reason for hiding this comment

Uh oh!

Demindiro Jul 7, 2022

Choose a reason for hiding this comment

Uh oh!

josephlr Jul 7, 2022

Choose a reason for hiding this comment

Uh oh!

josephlr Jul 7, 2022

Choose a reason for hiding this comment

Uh oh!

josephlr commented Jul 7, 2022

Uh oh!

Demindiro commented Jul 7, 2022

Uh oh!

Amanieu Jul 28, 2022

Choose a reason for hiding this comment

Uh oh!

Demindiro Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Amanieu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Demindiro commented Jul 3, 2022 •

edited

Loading

Demindiro commented Jul 3, 2022 •

edited

Loading

Demindiro Jul 28, 2022 •

edited

Loading