Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from llvm:main #170

Merged
merged 14 commits into from
Oct 17, 2021
Merged

[pull] main from llvm:main #170

merged 14 commits into from
Oct 17, 2021

Conversation

pull[bot]
Copy link

@pull pull bot commented Oct 17, 2021

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

kito-cheng and others added 14 commits October 17, 2021 16:38
Instead of creating real threads for trace tests
create a new ThreadState in the main thread.
This makes the tests more unit-testy and will also
help with future trace tests that will need
more than 1 thread. Creating more than 1 real thread and
dispatching test actions across multiple threads in the
required deterministic order is painful.

This is resubmit of reverted D110546 with 2 changes:
1. The previous version patched ImitateTlsWrite to not
expect ThreadState to be allocated in TLS (the CHECK
failed for the fake test threads).
This added an ugly hack into production code and was still
logically wrong because we imitated write to the main
thread TLS/stack when we started the fake test thread
(which has nothing to do with the main thread TLS/stack).
This version uses ThreadType::Fiber instead of ThreadType::Regular
for the fake threads. This naturally makes ThreadStart skip
obtaining stack/tls and imitating writes to them.

2. This version still skips the tests on Darwin and PowerPC
to be on the safer side. Build bots reported failures for PowerPC
for the previous version.

Reviewed By: melver

Differential Revision: https://reviews.llvm.org/D111156
SVE has predicated literal forms of some instructions for specific
literals, which currently are generated correctly when using ACLE
but not when those instructions are generated directly.

This adds the patterns to generate those instructions when
generating from standard LLVM IR instructions.

Differential Revision: https://reviews.llvm.org/D99074
The C and C++ standards require the argument to __has_cpp_attribute and
__has_c_attribute to be expanded ([cpp.cond]p5). It would make little sense
to expand the argument to those operators but not expand the argument to
__has_attribute and __has_declspec, so those were both also changed in this
patch.

Note that it might make sense for the other builtins to also expand their
argument, but it wasn't as clear to me whether the behavior would be correct
there, and so they were left for a future revision.
Previously, we reported the same value as for C17, now we report 202000L, which
is the same value currently used by GCC.

Once C23 ships, this value will be bumped to the correct date.
```
[5.1] 2.21.2 THREADPRIVATE Directive
A variable that appears in a threadprivate directive must be declared in
the scope of a module or have the SAVE attribute, either explicitly or
implicitly.
A variable that appears in a threadprivate directive must not be an
element of a common block or appear in an EQUIVALENCE statement.
```

This patch supports the following checks for DECLARE TARGET Directive:
```
[5.1] 2.14.7 Declare Target Directive
A variable that is part of another variable (as an array, structure
element or type parameter inquiry) cannot appear in a declare
target directive.
A variable that appears in a declare target directive must be declared
in the scope of a module or have the SAVE attribute, either explicitly
or implicitly.
A variable that appears in a declare target directive must not be an
element of a common block or appear in an EQUIVALENCE statement.
```

As Fortran 2018 standard [8.5.16] states, a variable, common block, or
procedure pointer declared in the scoping unit of a main program,
module, or submodule implicitly has the SAVE attribute, which may be
confirmed by explicit specification.

Reviewed By: kiranchandramohan

Differential Revision: https://reviews.llvm.org/D109864
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/YTeT9M7fW - for intels `Block RThroughput: <=212.0`; for ryzens, `Block RThroughput: <=64.0`
So could pick cost of `212`

For store we have:
https://godbolt.org/z/vc954KEGP - for intels `Block RThroughput: <=90.0`; for ryzens, `Block RThroughput: <=24.0`
So we could pick cost of `90`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111940
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/s5b6E6jsP - for intels `Block RThroughput: <=32.0`; for ryzens, `Block RThroughput: <=24.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/efh99d93b - for intels `Block RThroughput: <=48.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `48`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111942
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/11rcvdreP - for intels `Block RThroughput: <=68.0`; for ryzens, `Block RThroughput: <=48.0`
So could pick cost of `68`

For store we have:
https://godbolt.org/z/6aM11fWcP - for intels `Block RThroughput: <=64.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `64`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111943
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/MTaKboejM - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=16.0`
So could pick cost of `32`

For store we have:
https://godbolt.org/z/v7xPj3Wd4 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=32.0`
So we could pick cost of `32`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111944
A few more tuples are being queried after D111546. Might be good to model them,
They all require a lot of manual assembly surgery.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/9bnKrefcG - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So could pick cost of `40`

For store we have:
https://godbolt.org/z/5s3s14dEY - for intels `Block RThroughput: =40.0`; for ryzens, `Block RThroughput: =16.0`
So we could pick cost of `40`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111945
The multiply() implementation is very slow -- it performs six
multiplications in double the bitwidth, which means that it will
typically work on allocated APInts and bypass fast-path
implementations. Add an additional implementation that doesn't
try to produce anything better than a full range if overflow is
possible. At least for the BasicAA use-case, we really don't care
about more precise modeling of overflow behavior. The current
use of multiply() is fine while the implementation is limited to
a single index, but extending it to the multiple-index case makes
the compile-time impact untenable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants