-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge difference in results using different compiler (gcc) versions. #7
Comments
We really have no control over the code that is generated by GCC or any other C compiler. The C implementations are only provided here to compare them against the memcpy/memset from glibc and also against the assembly implementations. The assembly implementations should be rather deterministic. In your case we see that the memset implementation from glibc is not optimal for this hardware because even the C implementation is faster. And the performance of the generic C copy code is really unstable. If you are curious, you can try to have a look at the objdump logs and find the There is a good reason why glibc normally uses assembly implementations for memcpy/memset on all platforms, the modern compilers are simply not good enough. |
Also the |
I have tried to replace alloc_four_nonaliased_buffers() call with malloc()/memset() calls for src,dst and tmp buffers.
gcc (Debian 4.7.2-5) 4.7.2:
|
You still get a huge difference:
It looks like the C compiler generates very bad code either for the backwards copy (gcc 4.6.3) or for the forward copy (gcc 4.7.2). As I said, it's best to look at the generated assembly.
This handcrafted alignment is picked in such a way that we maximize the number of different address bits between the source and the destination pointers when doing memory copy operation. This is done in order to avoid fighting for the same cache set and evicting the freshly prefetched source data by doing writes to the destination buffer. Here is a pretty good explanation about the set-associative cache: https://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/set.html Some additional information:
Yes, we are artificially creating the most favourable conditions for getting the maximal possible speed out of the memory subsystem by using the |
Indeed, confirmed. Here is the code for the main loop of the
The memory load and store operations are reordered in a rather arbitrary way, compared to the original C source. But if we compile the code with -O1 optimizations, then the compiler stops reordering memory accesses and the forward copy performance becomes good. The current backwards copy memory accesses are not optimal for Cortex A7, so the following patch can be applied and also used together with -O1 option in CFLAGS to make it fast. Instead of the -O1 option, we could also make the pointers volatile and get a similar effect. But this is a high level C code after all. The compiler takes liberty to reorder memory accesses and does not do this job particularly good here. Too bad, but this is not our problem. Any performance critical code should use hand optimized assembly anyway :) |
I mean difference between compilers of different versions, not between copy directions. (Sorry, my English isn't good)
Without handcrafted alignments both GCC versions generates code that work faster for the forward copy, and much worse for backward copy. |
Yes, with patch and -O1 switch speed the same in both directions and only slightly different for different gcc versions. Thanks. gcc (Debian 4.6.3-14) 4.6.3:
gcc (Debian 4.7.2-5) 4.7.2:
|
This is a pure speculation, but I suspect that sometimes reordering memory accesses by making them jump back and forth instead of reading/writing sequentially may upset the Cortex-A7 processor and the performance drops. Either the automatic prefetcher becomes confused or maybe some kind of write-combining logic stops working. |
It seems so. When I looked into assembler I didn't realise that this back-and-forth DRAM access pattern can reduce bandwidth so drastically. |
The C compiler may attempt to reorder read and write operations when accessing the source and destination buffers. So instead of sequential memory accesses we may get something like a "drunk master style" memory access pattern. Certain processors, such as ARM Cortex-A7, do not like such memory access pattern very much and it causes a major performance drop. The actual access pattern is unpredictable and is sensitive to the compiler version, optimization flags and even sometimes on some changes in unrelated parts of source code. So use the volatile keyword for the destination pointer in order to resolve this problem and make C benchmarks more deterministic. See #7
This is expected to test the ability to do write combining for scattered writes and detect any possible performance penalties. Example reports: == ARM Cortex A7 == C fill : 4011.5 MB/s C fill (shuffle within 16 byte blocks) : 4112.2 MB/s (0.3%) C fill (shuffle within 32 byte blocks) : 333.9 MB/s C fill (shuffle within 64 byte blocks) : 336.6 MB/s == ARM Cortex A15 == C fill : 6065.2 MB/s (0.4%) C fill (shuffle within 16 byte blocks) : 2152.0 MB/s C fill (shuffle within 32 byte blocks) : 2150.7 MB/s C fill (shuffle within 64 byte blocks) : 2238.2 MB/s == ARM Cortex A53 == C fill : 3080.8 MB/s (0.2%) C fill (shuffle within 16 byte blocks) : 3080.7 MB/s C fill (shuffle within 32 byte blocks) : 3079.2 MB/s C fill (shuffle within 64 byte blocks) : 3080.4 MB/s == Intel Atom N450 == C fill : 1554.9 MB/s C fill (shuffle within 16 byte blocks) : 1554.5 MB/s C fill (shuffle within 32 byte blocks) : 1553.9 MB/s C fill (shuffle within 64 byte blocks) : 1554.4 MB/s See #7
Orange PI PC board. (Allwinner H3(4xCortex-A7), Debian Wheezy(loboris))
Using gcc (Debian 4.6.3-14) 4.6.3:
Using gcc (Debian 4.7.2-5) 4.7.2
The text was updated successfully, but these errors were encountered: