Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge difference in results using different compiler (gcc) versions. #7

Closed
dvl36 opened this issue Mar 26, 2016 · 9 comments
Closed
Labels

Comments

@dvl36
Copy link

dvl36 commented Mar 26, 2016

Orange PI PC board. (Allwinner H3(4xCortex-A7), Debian Wheezy(loboris))
Using gcc (Debian 4.6.3-14) 4.6.3:

 C copy backwards                                     :    258.6 MB/s
 C copy                                               :   1011.9 MB/s
 C copy prefetched (32 bytes step)                    :   1035.8 MB/s (0.5%)
 C copy prefetched (64 bytes step)                    :   1025.2 MB/s
 C 2-pass copy                                        :    816.1 MB/s (0.3%)
 C 2-pass copy prefetched (32 bytes step)             :    871.9 MB/s
 C 2-pass copy prefetched (64 bytes step)             :    872.9 MB/s (0.4%)
 C fill                                               :   3960.6 MB/s (0.4%)
 ---
 standard memcpy                                      :   1098.0 MB/s (0.3%)
 standard memset                                      :   3568.1 MB/s
 ---

Using gcc (Debian 4.7.2-5) 4.7.2

 C copy backwards                                     :   1069.7 MB/s
 C copy                                               :    287.4 MB/s (0.3%)
 C copy prefetched (32 bytes step)                    :    300.1 MB/s
 C copy prefetched (64 bytes step)                    :    300.0 MB/s (0.2%)
 C 2-pass copy                                        :    242.2 MB/s (0.1%)
 C 2-pass copy prefetched (32 bytes step)             :    248.9 MB/s (0.4%)
 C 2-pass copy prefetched (64 bytes step)             :    245.9 MB/s
 C fill                                               :   3988.2 MB/s
 ---
 standard memcpy                                      :   1133.7 MB/s
 standard memset                                      :   3620.2 MB/s
 ---
@ssvb ssvb added the question label Mar 26, 2016
@ssvb
Copy link
Owner

ssvb commented Mar 26, 2016

We really have no control over the code that is generated by GCC or any other C compiler. The C implementations are only provided here to compare them against the memcpy/memset from glibc and also against the assembly implementations. The assembly implementations should be rather deterministic.

In your case we see that the memset implementation from glibc is not optimal for this hardware because even the C implementation is faster. And the performance of the generic C copy code is really unstable. If you are curious, you can try to have a look at the objdump logs and find the aligned_block_copy, aligned_block_copy_backwards, aligned_block_copy_pf32 and aligned_block_copy_pf64 functions.

There is a good reason why glibc normally uses assembly implementations for memcpy/memset on all platforms, the modern compilers are simply not good enough.

@ssvb
Copy link
Owner

ssvb commented Mar 26, 2016

Also the tinymembench program is not quite a benchmark, but more like a tool to detect memory related performance abnormalities. That's why it is a collection of different implementations for memory copy and fill operations. Ideally they should be equally fast. But in reality this is not always the case.

@dvl36
Copy link
Author

dvl36 commented Mar 26, 2016

I have tried to replace alloc_four_nonaliased_buffers() call with malloc()/memset() calls for src,dst and tmp buffers.
The results is much worse, but differences isn't so huge.
So it seems like something wrong with "handcrafted" alignments in alloc_four_nonaliased_buffers()
gcc (Debian 4.6.3-14) 4.6.3:

 C copy backwards                                     :    245.1 MB/s
 C copy                                               :    766.1 MB/s
 C copy prefetched (32 bytes step)                    :    311.7 MB/s
 C copy prefetched (64 bytes step)                    :    311.7 MB/s
 C 2-pass copy                                        :    953.2 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    349.4 MB/s (0.2%)
 C 2-pass copy prefetched (64 bytes step)             :    349.5 MB/s
 C fill                                               :   3953.0 MB/s
 ---
 standard memcpy                                      :    636.3 MB/s
 standard memset                                      :   3564.8 MB/s
 ---

gcc (Debian 4.7.2-5) 4.7.2:

 C copy backwards                                     :    343.8 MB/s
 C copy                                               :    794.3 MB/s (0.1%)
 C copy prefetched (32 bytes step)                    :    839.0 MB/s (0.3%)
 C copy prefetched (64 bytes step)                    :    844.6 MB/s (0.4%)
 C 2-pass copy                                        :    611.1 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    622.3 MB/s
 C 2-pass copy prefetched (64 bytes step)             :    622.3 MB/s
 C fill                                               :   4025.8 MB/s
 ---
 standard memcpy                                      :    642.1 MB/s
 standard memset                                      :   3628.6 MB/s
 ---

@ssvb
Copy link
Owner

ssvb commented Mar 26, 2016

The results is much worse, but differences isn't so huge.

You still get a huge difference:

  • Before: 1011.9 MB/s vs. 258.6 MB/s
  • After: 766.1 MB/s vs. 245.1 MB/s

It looks like the C compiler generates very bad code either for the backwards copy (gcc 4.6.3) or for the forward copy (gcc 4.7.2). As I said, it's best to look at the generated assembly.

So it seems like something wrong with "handcrafted" alignments in alloc_four_nonaliased_buffers()

This handcrafted alignment is picked in such a way that we maximize the number of different address bits between the source and the destination pointers when doing memory copy operation. This is done in order to avoid fighting for the same cache set and evicting the freshly prefetched source data by doing writes to the destination buffer. Here is a pretty good explanation about the set-associative cache: https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/set.html

Some additional information:

  • Memory read performance heavily depends on the CPU cache. DRAM has a huge latency, so in order to do fast sequential reads, we need to prefetch data into the cache at some distance ahead. There is software prefetch, where prefetching can be done using special instructions. And there is also automatic hardware prefetch, where the CPU is able to track sequential accesses and automatically do prefetching under the hood.
  • Cache line replacement inside of each cache set is random. It means that no matter how many cache ways we have (the L1 data cache is 4-way associative in Cortex-A7), there is always a chance to get a useful cache line evicted by just accessing the same set (via a write to the destination buffer). With the LRU replacement method there would be no such problem.
  • The cache is physically tagged. It means that the buffer at exactly the same virtual addresses in the process address space (for example a static array) can be backed by arbitrarily fragmented pages in the physical memory at arbitrary locations. And the CPU cache is using physical addresses, not the virtual ones. So the performance results may vary wildly across different runs of the same application because the actual physical memory pages layout may be different.

Yes, we are artificially creating the most favourable conditions for getting the maximal possible speed out of the memory subsystem by using the alloc_four_nonaliased_buffers() function for allocating buffers. Real performance in real applications is usually going to be the same or worse in a somewhat unpredictable way.

@ssvb
Copy link
Owner

ssvb commented Mar 27, 2016

It looks like the C compiler generates very bad code either for the backwards copy (gcc 4.6.3) or for the forward copy (gcc 4.7.2). As I said, it's best to look at the generated assembly.

Indeed, confirmed. Here is the code for the main loop of the aligned_block_copy() function generated by the gcc 4.7.3 compiler:

9e8c:       e1c140d8        ldrd    r4, [r1, #8]
9e90:       e2522040        subs    r2, r2, #64     ; 0x40
9e94:       e1c1a2d0        ldrd    sl, [r1, #32]
9e98:       e1c182d8        ldrd    r8, [r1, #40]   ; 0x28
9e9c:       e1c040f8        strd    r4, [r0, #8]
9ea0:       e1c141d0        ldrd    r4, [r1, #16]
9ea4:       e1c163d0        ldrd    r6, [r1, #48]   ; 0x30
9ea8:       e1c0a2f0        strd    sl, [r0, #32]
9eac:       e1c041f0        strd    r4, [r0, #16]
9eb0:       e1c141d8        ldrd    r4, [r1, #24]
9eb4:       e1c082f8        strd    r8, [r0, #40]   ; 0x28
9eb8:       e1c063f0        strd    r6, [r0, #48]   ; 0x30
9ebc:       e1c041f8        strd    r4, [r0, #24]
9ec0:       e1c140d0        ldrd    r4, [r1]
9ec4:       e1c040f0        strd    r4, [r0]
9ec8:       e1c143d8        ldrd    r4, [r1, #56]   ; 0x38
9ecc:       e2811040        add     r1, r1, #64     ; 0x40
9ed0:       e1c043f8        strd    r4, [r0, #56]   ; 0x38
9ed4:       e2800040        add     r0, r0, #64     ; 0x40
9ed8:       5affffeb        bpl     9e8c <aligned_block_copy+0xc>

The memory load and store operations are reordered in a rather arbitrary way, compared to the original C source. But if we compile the code with -O1 optimizations, then the compiler stops reordering memory accesses and the forward copy performance becomes good. The current backwards copy memory accesses are not optimal for Cortex A7, so the following patch can be applied and also used together with -O1 option in CFLAGS to make it fast. Instead of the -O1 option, we could also make the pointers volatile and get a similar effect.

But this is a high level C code after all. The compiler takes liberty to reorder memory accesses and does not do this job particularly good here. Too bad, but this is not our problem. Any performance critical code should use hand optimized assembly anyway :)

@dvl36
Copy link
Author

dvl36 commented Mar 27, 2016

You still get a huge difference:

  • Before: 1011.9 MB/s vs. 258.6 MB/s

I mean difference between compilers of different versions, not between copy directions. (Sorry, my English isn't good)

It looks like the C compiler generates very bad code either for the backwards copy (gcc 4.6.3) or for the forward copy (gcc 4.7.2).

Without handcrafted alignments both GCC versions generates code that work faster for the forward copy, and much worse for backward copy.

@dvl36
Copy link
Author

dvl36 commented Mar 27, 2016

The current backwards copy memory accesses are not optimal for Cortex A7, so the following patch can be applied and also used together with -O1 option in CFLAGS to make it fast.

Yes, with patch and -O1 switch speed the same in both directions and only slightly different for different gcc versions. Thanks.

gcc (Debian 4.6.3-14) 4.6.3:

 C copy backwards                                     :   1040.5 MB/s
 C copy                                               :   1080.8 MB/s
 C copy prefetched (32 bytes step)                    :   1005.3 MB/s
 C copy prefetched (64 bytes step)                    :   1005.4 MB/s
 C 2-pass copy                                        :    898.6 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    972.1 MB/s
 C 2-pass copy prefetched (64 bytes step)             :    972.1 MB/s
 C fill                                               :   4019.5 MB/s
 ---
 standard memcpy                                      :   1119.2 MB/s
 standard memset                                      :   3631.8 MB/s
 ---

gcc (Debian 4.7.2-5) 4.7.2:

 C copy backwards                                     :   1016.1 MB/s
 C copy                                               :   1064.4 MB/s (0.4%)
 C copy prefetched (32 bytes step)                    :    992.0 MB/s
 C copy prefetched (64 bytes step)                    :   1000.5 MB/s (0.4%)
 C 2-pass copy                                        :    889.7 MB/s
 C 2-pass copy prefetched (32 bytes step)             :    957.1 MB/s
 C 2-pass copy prefetched (64 bytes step)             :    957.5 MB/s
 C fill                                               :   3994.2 MB/s (0.4%)
 ---
 standard memcpy                                      :   1101.5 MB/s (0.2%)
 standard memset                                      :   3568.5 MB/s
 ---

@ssvb
Copy link
Owner

ssvb commented Mar 27, 2016

This is a pure speculation, but I suspect that sometimes reordering memory accesses by making them jump back and forth instead of reading/writing sequentially may upset the Cortex-A7 processor and the performance drops. Either the automatic prefetcher becomes confused or maybe some kind of write-combining logic stops working.

@dvl36
Copy link
Author

dvl36 commented Mar 27, 2016

It seems so. When I looked into assembler I didn't realise that this back-and-forth DRAM access pattern can reduce bandwidth so drastically.

@dvl36 dvl36 closed this as completed Mar 28, 2016
ssvb added a commit that referenced this issue Mar 29, 2016
The C compiler may attempt to reorder read and write operations when
accessing the source and destination buffers. So instead of sequential
memory accesses we may get something like a "drunk master style"
memory access pattern. Certain processors, such as ARM Cortex-A7,
do not like such memory access pattern very much and it causes
a major performance drop. The actual access pattern is unpredictable
and is sensitive to the compiler version, optimization flags and
even sometimes on some changes in unrelated parts of source code.

So use the volatile keyword for the destination pointer in order
to resolve this problem and make C benchmarks more deterministic.

See #7
ssvb added a commit that referenced this issue Mar 29, 2016
This is expected to test the ability to do write combining for
scattered writes and detect any possible performance penalties.

Example reports:

== ARM Cortex A7 ==
 C fill                                               :   4011.5 MB/s
 C fill (shuffle within 16 byte blocks)               :   4112.2 MB/s (0.3%)
 C fill (shuffle within 32 byte blocks)               :    333.9 MB/s
 C fill (shuffle within 64 byte blocks)               :    336.6 MB/s

== ARM Cortex A15 ==
 C fill                                               :   6065.2 MB/s (0.4%)
 C fill (shuffle within 16 byte blocks)               :   2152.0 MB/s
 C fill (shuffle within 32 byte blocks)               :   2150.7 MB/s
 C fill (shuffle within 64 byte blocks)               :   2238.2 MB/s

== ARM Cortex A53 ==
 C fill                                               :   3080.8 MB/s (0.2%)
 C fill (shuffle within 16 byte blocks)               :   3080.7 MB/s
 C fill (shuffle within 32 byte blocks)               :   3079.2 MB/s
 C fill (shuffle within 64 byte blocks)               :   3080.4 MB/s

== Intel Atom N450 ==
 C fill                                               :   1554.9 MB/s
 C fill (shuffle within 16 byte blocks)               :   1554.5 MB/s
 C fill (shuffle within 32 byte blocks)               :   1553.9 MB/s
 C fill (shuffle within 64 byte blocks)               :   1554.4 MB/s

See #7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants