Skip to content

Conversation

@joakim-noah
Copy link
Contributor

x86/x64 has no problem with unaligned reads, but the llvm optimizer gets confused by the cast in the foreach and assumes the Element is 32-bit aligned, generating ARM instructions that require word alignment and causing problems with the last unaligned test.

As for posix_memalign, turns out its declared in druntime but wasn't for OSX, so added it in dlang/druntime#1819 and used the import instead. That druntime PR will need to be merged first.

@CyberShadow
Copy link
Member

As these two changes are in completely unrelated parts of the code, it would be nice to at least do them in separate commits.

@ibuclaw
Copy link
Member

ibuclaw commented May 3, 2017

the llvm optimizer gets confused by the cast in the foreach and assumes the Element is 32-bit aligned,

Alternatively, you could just fix the problem with llvm. ☺️

@joakim-noah
Copy link
Contributor Author

The problem isn't with llvm. The cast takes a possibly unaligned ubyte[] chunk and puts it in a type that is word-aligned by default in D, like uint or ulong. That works fine on CPU arches that always allow unaligned reads, like x86, but not ARM.

If you like, try it with gdc and see if it does any better. The last unaligned tests in the module trip up when optimized by ldc.

@joakim-noah
Copy link
Contributor Author

joakim-noah commented May 3, 2017

Split the changes into two commits and fixed a misnamed temp variable.

@kinke and @gchatelet, review needed.

@ibuclaw
Copy link
Member

ibuclaw commented May 3, 2017

That works fine on CPU arches that always allow unaligned reads, like x86, but not ARM.

Yes, I know. It still seems strange that you are taking the ref const, then removing the ref. Why not just make the loop foreach (const Element; ...).

It's probably not just ARM, affected, so maybe some version (NeedAlignedLoads) might be better, that's a topic for the bikeshed anyhow.

@joakim-noah
Copy link
Contributor Author

It still seems strange that you are taking the ref const, then removing the ref. Why not just make the loop foreach (const Element; ...).

I didn't want to touch anything on x86/x64, which @gchatelet did a lot of work to optimize, in #3916.

It's probably not just ARM, affected, so maybe some version (NeedAlignedLoads) might be better, that's a topic for the bikeshed anyhow.

Sure, he mentioned MIPS also in a comment earlier in the module. I figure it's better for those ports to add it themselves, as needed. If you have some arch in mind, try it with gdc and let me know what needs adding.

{
// Since the block may be unaligned, this can trip up ARM
// codegen, so place it in a newly aligned Element first.
Element alignedTemp = block;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this fix is correct. Evaluating the right hand side (block) is still an unaligned read. The generated code just happens to be able to deal with that. The actual issue is the cast to Element[], which violates alignment guarantees. "Fixing" it like this is just asking for the problem to return later.

A correct fix would be a loop that iterates over data in chunks of the right size and memcpys it to a stack local (e.g. by using slice assignment on the variable casted to void[]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm not familiar with these alignment issues and exactly how D code is likely to trigger them: I just used the simplest fix that worked. I suppose you want something more like @kinke's patch, which uses different types based on the alignment, though maybe he had problems with certain types that are commented out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually surprised this apparently fixed the issue for LDC, it smells like wasted optimization opportunity somewhere. A rather compact proper (I hope, I just typed it blindly) fix with no optimizations for alignments of 4/2 bytes would be:

version(ARM)
{
    // data.ptr appropriately aligned for type Element?
    if ((cast(size_t) data.ptr) & (Element.alignof - 1) == 0)
    {
        foreach (ref const Element block; cast(const(Element[]))(data[0 .. remainderStart]))
            putElement(block);
    }
    else
    {
        Element alignedBlock = void;
        for (size_t i = 0; i != remainderStart; i += Element.sizeof)
        {
            (cast(ubyte*) &alignedBlock)[0 .. Element.sizeof] = data[i .. i + Element.sizeof];
            putElement(alignedBlock);
        }
    }
}
else
{
    foreach (ref const Element block; cast(const(Element[]))(data[0 .. remainderStart]))
        putElement(block);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm not familiar with these alignment issues and exactly how D code is likely to trigger them

Although explicit documentation is scarce (with DMD being so x86-centric, though the same problem occurs with vector instructions there as well), the basic idea is that every pointer of type T* is guaranteed to point to an address aligned to T.alignof. The code generator can of course then assume this when using the pointer in loads/stores. In your code, the cast(Element[]) is still wrong, as .ptr is not aligned afterwards.

Note that assuming alignment for arrays is the only sensible thing to do, otherwise every access to an array with elements larger than a byte would be dog-slow on architectures that don't support unaligned loads, as we would need to emit a byte-by-byte load. (We should add a debug check to array casts to catch this, by the way.)

numChunks = processChunks!(ushort[size / 16])();
else
numChunks = processChunks!(ubyte[size / 8])();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lightly modified version of @kinke's fix, let me know if this suffices. It passes the extended tests below on ARM.

numChunks = processChunks!(uint)();
/* TODO: cannot cast ushort/ubyte slice to Element = uint
else if ((startAddress & 1) == 0)
numChunks = processChunks!(ushort[2])();
Copy link
Contributor Author

@joakim-noah joakim-noah May 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't a ushort[2] be cast to a uint?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be sorted out, otherwise the 32-bit hasher will still fail for unaligned data.

@joakim-noah joakim-noah changed the title Add ARM alignment fix and use declaration from druntime Add ARM alignment fix, version out some HardFloat tests, and use declaration from druntime May 21, 2017
}

@system unittest
version(D_HardFloat) @system unittest
Copy link
Contributor Author

@joakim-noah joakim-noah May 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this added commit, needed when running the tests on Android/ARM for the merge-2.074 branch of ldc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs moving to a separate pr IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so small, I'd rather just do it here. What's the problem with this commit? These three commits are what I need to get the tests passing on Android/ARM, fairly small PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one of the commits is fine, we can merge it now without waiting on the other two. I'm not sure about this change, but I haven't looked at the body, and I don't wish to discuss here, as the topic is about murmurhash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topic is some small changes I had to make for Android/ARM, ie the contents of this PR. The notion that one can't discuss this tiny commit here because of the surrounding discussion about murmurhash3 is laughable, as most non-negligible Phobos PRs have many such topical threads of discussion.

@joakim-noah
Copy link
Contributor Author

ping, made changes asked for.

immutable unalignedHash = digest!H(data[1 .. $]); // 1 .. 1024
assert(alignedHash == unalignedHash);
immutable ubyte[1028] data = 0xAC;
immutable alignedHash = digest!H(data[0 .. 1023]);
Copy link
Contributor

@kinke kinke May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 .. 1024 (contrary to the misleading previous comment) if you want to preserve the previous behavior (hashing a full 1K chunk); same for 3 lines below.
data isn't guaranteed to be aligned (wrt. 'alignedHash'), its declaration would need an align(Element.alignof) for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought that looked off, was just lazy to check it, corrected the last index now. As for the alignment, it doesn't have to be but it usually is, and the right behavior will be checked regardless, as it's cycling through 5 different byte alignments.

import core.stdc.string : memcpy;
Element block;
()@trusted{memcpy(&block, &chunk, Element.sizeof);}();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this in to fix the 32-bit unaligned read you pointed out is still there, @kinke, following David's suggestion. I'm not having any issues with the way it was before or now, ie the test passes, but this won't screw up with some optimizer pass later on.

Copy link
Member

@ibuclaw ibuclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the std.math and allocator patches as a separate pr each? Discussing them here would just create noise in the thread.

}

@system unittest
version(D_HardFloat) @system unittest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs moving to a separate pr IMO.

import core.stdc.string : memcpy;
Element block;
()@trusted{memcpy(&block, &chunk, Element.sizeof);}();
putElement(block);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged and reorganized the 32-bit alignment fix I added yesterday, @kinke, here's the memcpy workaround David suggested earlier. I benchmarked it on unaligned data using @gchatelet's benchmark, it's usually only 1-2% slower, around 4 GiB/s.

@joakim-noah
Copy link
Contributor Author

@ibuclaw, I'm not moving a two-line patch to a different PR, I see no reason not to discuss all three commits here.

@joakim-noah
Copy link
Contributor Author

joakim-noah commented Jun 21, 2017

Hey Iain, since you did all that work to get gdc into gcc, I'd like to point out that this PR is essentially a gift from ldc to gdc, stemming from issues that we ran into when updating to 2.072 and 2.074, as dmd won't use the two ARM-related commits anyway.

After David approved this PR, I merged the first two commits into ldc's branch of phobos, and when I tried to merge the last one into @kinke's WIP ldc branch for 2.074, I saw that he already stubbed out those tests because of other issues.

Whenever gdc updates to a newer phobos, you will likely need these patches for ARM. If you want me to modify them in some way to better suit gdc, I'll be happy to discuss that. Until then, this PR is only here to benefit gdc, as ldc is already using it.

@PetarKirov
Copy link
Member

(Closing & reopening the PR in order to restart the CI tests.)

@PetarKirov PetarKirov closed this Jun 23, 2017
@PetarKirov PetarKirov reopened this Jun 23, 2017
@dlang-bot
Copy link
Contributor

dlang-bot commented Jun 23, 2017

Thanks for your pull request, @joakim-noah!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

assert(alignedHash == unalignedHash);
immutable ubyte[1028] data = 0xAC;
immutable alignedHash = digest!H(data[0 .. 1024]);
foreach(i; 1 .. 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CirclCI fails because there needs to be a space between foreach and (.

@PetarKirov
Copy link
Member

PetarKirov commented Jun 23, 2017

Sizeof BCValue: 40LU
60 opcodes remaining
/dev/shm/dtest/work/repo/dlang.org/../dmd/src/ddmd/ctfe/bc_skeleton.d(9): Error: undefined identifier BCFunction
SDCs LLVM Header are not avilable
LLVM_Backend is not compiled
BCGen
immutable(const(BCValue) function(BCValue[], BCHeap*) @safe)
C_BCGen
immutable(BCValue function(BCValue[], BCHeap*) pure @safe)
posix.mak:527: recipe for target '.generated/docs-prerelease.json' failed
make: *** [.generated/docs-prerelease.json] Error

@CyberShadow it looks like DAutoTest was trying to build the newCTFE dmd branch, instead of master or stable. Any ideas? Probably restarting the test would fix the error. Is there a way I can do that without force-pushing in Joakim's branch? (I'm on the phone currently and I don't have a proper build environment setup).

Edit: Nevermind closing & reopening tge PR did the trick, though there was a slight delay, compared to the other CIs.

@joakim-noah
Copy link
Contributor Author

Thanks for the spacing tip, Petar, added it to my commit and removed yours.

@joakim-noah joakim-noah changed the title Add ARM alignment fix, version out some HardFloat tests, and use declaration from druntime ARM and Android fixes: Add ARM alignment fix, version out some HardFloat tests, use TMPDIR logic and declaration from druntime Sep 16, 2017
{
// Don't check for a global temporary directory as
// Android doesn't have one.
}
Copy link
Contributor Author

@joakim-noah joakim-noah Sep 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another Android commit, as ldc is now packaged on Android itself, and the Termux app does set TMPDIR. This was versioned out because it was assumed that most would run D code from their own Android GUI app, ie an apk, which don't have any global temp directory. However, now that D can be used from the command-line on Android itself, with an app that sets TMPDIR for all command-line D tools, just use the below Posix logic. You will have to set the tempDir by hand for each apk anyway, if you use it.

This already hit me when building rdmd for Android, where I had to patch it to check TMPDIR, can take that out with this change.

@joakim-noah
Copy link
Contributor Author

Removed the largest murmurhash commit, as it was originally written by @kinke and he has since reworked it and submitted a new version in another pull. Also added one more commit with a doc fix, this pull is very small and easy to review now, @wilzbach.

@joakim-noah joakim-noah changed the title ARM and Android fixes: Add ARM alignment fix, version out some HardFloat tests, use TMPDIR logic and declaration from druntime ARM and Android fixes: version out some HardFloat tests, use TMPDIR logic and declaration from druntime, and correct comment Dec 22, 2017
@dnadlinger dnadlinger dismissed ibuclaw’s stale review December 22, 2017 14:00

The murmurhash changes are no longer part of this PR.

@dlang-bot dlang-bot merged commit 15adac4 into dlang:master Dec 22, 2017
chunorig pushed a commit to chunorig/gcc-kludges that referenced this pull request Jan 26, 2019
Commits merged from druntime.

    Fix struct tls_index definition on x32
    dlang/druntime#2354

    Update SectionGroup signatures to match on all targets
    dlang/druntime#2401

    Fix issue 19128 - argument to alloca may be too large
    dlang/druntime#2409

    Define some common filesystem limits in core.stdc.limits
    dlang/druntime#2460

    Use version Darwin instead of OSX in core.sys.posix.aio
    dlang/druntime#2470

Commits merged from phobos.

    Don't run HardFloat tests on SoftFloat systems
    dlang/phobos#5358

    Remove reliance on stdin, stdout, stderr being aliasable
    dlang/phobos#5718

    Solaris: add import clock_gettime to currStdTime
    dlang/phobos#5807

    Don't print debug messages when building unittests
    dlang/phobos#6827

    Add HPPA support to phobos
    Fixes https://gcc.gnu.org/PR89054
    dlang/phobos#6836

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@268293 138bc75d-0d04-0410-961f-82ee72b054a4
ibuclaw added a commit to ibuclaw/phobos that referenced this pull request Feb 10, 2019
PetarKirov added a commit that referenced this pull request Feb 11, 2019
[dmd-cxx] Backport non-math related patches from #5358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants