Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic information #8

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Basic information #8

wants to merge 2 commits into from

Conversation

5hello
Copy link

@5hello 5hello commented Oct 12, 2016

No description provided.

kraj pushed a commit to kraj/gcc that referenced this pull request Aug 22, 2019
Like the logical operations, expand all shifts early rather than only
sometimes.  The Neon shift expansions are never emitted (not even with
-fneon-for-64bits), so they are not useful.  So all the late expansions
and Neon shift patterns can be removed, and shifts are more optimized
as a result.  Since some extend patterns use Neon DImode shifts, remove
the Neon extend variants and related splits.

A simple example now generates the same efficient code after this
patch with -mfpu=neon and -mfpu=vfp (previously just the fact of
having Neon enabled resulted inefficient code for no reason).

unsigned long long f(unsigned long long x, unsigned long long y)
{ return x & (y >> 33); }

Before:
	strd    r4, r5, [sp, #-8]!
	lsr     r4, r3, #1
	mov     r5, #0
	and     r1, r1, r5
	and     r0, r0, r4
	ldrd    r4, r5, [sp]
	add     sp, sp, gcc-mirror#8
	bx      lr

After:
	and     r0, r0, r3, lsr #1
	mov     r1, #0
	bx      lr

Bootstrap and regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57

    gcc/
	* config/arm/iterators.md (qhs_extenddi_cstr): Update.
	(qhs_extenddi_cstr): Likewise.
	* config/arm/arm.md (ashldi3): Always expand early.
	(ashlsi3): Likewise.
	(ashrsi3): Likewise.
	(zero_extend<mode>di2): Remove Neon variants.
	(extend<mode>di2): Likewise.
	* config/arm/neon.md (ashldi3_neon_noclobber): Remove.
	(signed_shift_di3_neon): Likewise.
	(unsigned_shift_di3_neon): Likewise.
	(ashrdi3_neon_imm_noclobber): Likewise.
	(lshrdi3_neon_imm_noclobber): Likewise.
	(<shift>di3_neon): Likewise.
	(split extend): Remove DI extend split patterns.

   gcc/testsuite/
	* gcc.target/arm/neon-extend-1.c: Remove test.
	* gcc.target/arm/neon-extend-2.c: Remove test.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@274824 138bc75d-0d04-0410-961f-82ee72b054a4
kraj pushed a commit to kraj/gcc that referenced this pull request Sep 10, 2019
…ested function

In FDPIC mode, the trampoline generated to support pointers to nested
functions looks like:

	   .word	trampoline address
	   .word	trampoline GOT address
	   ldr 		r12, [pc, gcc-mirror#8]
	   ldr 		r9, [pc, gcc-mirror#8]
	   ldr		pc, [pc, gcc-mirror#8]
	   .word	static chain value
	   .word	GOT address
	   .word	function's address

because in FDPIC function pointers are actually pointers to function
descriptors, we have to actually generate a function descriptor for
the trampoline.

2019--09-10  Christophe Lyon  <christophe.lyon@st.com>
	Mickaël Guêné <mickael.guene@st.com>

	gcc/
	* config/arm/arm.c (arm_asm_trampoline_template): Add FDPIC
	support.
	(arm_trampoline_init): Likewise.
	(arm_trampoline_adjust_address): Likewise.
	* config/arm/arm.h (TRAMPOLINE_SIZE): Likewise.



git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@275571 138bc75d-0d04-0410-961f-82ee72b054a4
kraj pushed a commit to kraj/gcc that referenced this pull request Dec 17, 2019
This patch extends support for -mpure-code to all thumb-1 processors,
by removing the need for MOVT.

Symbol addresses are built using upper8_15, upper0_7, lower8_15 and
lower0_7 relocations, and constants are built using sequences of
movs/adds and lsls instructions.

The extension of the *thumb1_movhf pattern uses always the same size
(6) although it can emit a shorter sequence when possible. This is
similar to what *arm32_movhf already does.

CASE_VECTOR_PC_RELATIVE is now false with -mpure-code, to avoid
generating invalid assembly code with differences from symbols from
two different sections (the difference cannot be computed by the
assembler).

Tests pr45701-[12].c needed a small adjustment to avoid matching
upper8_15 when looking for the r8 register.

Test no-literal-pool.c is augmented with __fp16, so it now uses
-mfp16-format=ieee.

Test thumb1-Os-mult.c generates an inline code sequence with
-mpure-code and computes the multiplication by using a sequence of
add/shift rather than using the multiply instruction, so we skip it in
presence of -mpure-code.

With -mcpu=cortex-m0, the pure-code/no-literal-pool.c fails because
code like:
static char *p = "Hello World";
char *
testchar ()
{
  return p + 4;
}

generates 2 indirections (I removed non-essential directives/code)
          .section        .rodata
	  .LC0:
	  .ascii  "Hello World\000"
	  .data
	  p:
	  .word   .LC0
	  .section        .rodata
	  .LC2:
	  .word   p
	  .section .text,"0x20000006",%progbits
	  testchar:
	  push    {r7, lr}
	  add     r7, sp, #0
	  movs    r3, #:upper8_15:#.LC2
	  lsls    r3, gcc-mirror#8
	  adds    r3, #:upper0_7:#.LC2
	  lsls    r3, gcc-mirror#8
	  adds    r3, #:lower8_15:#.LC2
	  lsls    r3, gcc-mirror#8
	  adds    r3, #:lower0_7:#.LC2
	  ldr     r3, [r3]
	  ldr     r3, [r3]
	  adds    r3, r3, gcc-mirror#4
	  movs    r0, r3
	  mov     sp, r7
	  @ sp needed
	  pop     {r7, pc}

By contrast, when using -mcpu=cortex-m4, the code looks like:
        .section        .rodata
	.LC0:
	.ascii  "Hello World\000"
	.data
	p:
	.word   .LC0
	testchar:
	push    {r7}
	add     r7, sp, #0
	movw    r3, #:lower16:p
	movt    r3, #:upper16:p
	ldr     r3, [r3]
	adds    r3, r3, gcc-mirror#4
	mov     r0, r3
	mov     sp, r7
	pop     {r7}
	bx      lr

I haven't found yet how to make code for cortex-m0 apply upper/lower
relocations to "p" instead of .LC2. The current code looks functional,
but could be improved.

2019-10-18  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm-protos.h (thumb1_gen_const_int): Add new prototype.
	* config/arm/arm.c (arm_option_check_internal): Remove restriction
	on MOVT for -mpure-code.
	(thumb1_gen_const_int): New function.
	(thumb1_legitimate_address_p): Support -mpure-code.
	(thumb1_rtx_costs): Likewise.
	(thumb1_size_rtx_costs): Likewise.
	(arm_thumb1_mi_thunk): Likewise.
	* config/arm/arm.h (CASE_VECTOR_PC_RELATIVE): Likewise.
	* config/arm/thumb1.md (thumb1_movsi_symbol_ref): New.
	(*thumb1_movhf): Support -mpure-code.

	gcc/testsuite/
	* gcc.target/arm/pr45701-1.c: Adjust for -mpure-code.
	* gcc.target/arm/pr45701-2.c: Likewise.
	* gcc.target/arm/pure-code/no-literal-pool.c: Add tests for
	__fp16.
	* gcc.target/arm/pure-code/pure-code.exp: Remove thumb2 and movt
	conditions.
	* gcc.target/arm/thumb1-Os-mult.c: Skip if -mpure-code is used.




git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@279463 138bc75d-0d04-0410-961f-82ee72b054a4
jwakely pushed a commit to jwakely/gcc that referenced this pull request Jan 14, 2020
This patch extends support for -mpure-code to all thumb-1 processors,
by removing the need for MOVT.

Symbol addresses are built using upper8_15, upper0_7, lower8_15 and
lower0_7 relocations, and constants are built using sequences of
movs/adds and lsls instructions.

The extension of the *thumb1_movhf pattern uses always the same size
(6) although it can emit a shorter sequence when possible. This is
similar to what *arm32_movhf already does.

CASE_VECTOR_PC_RELATIVE is now false with -mpure-code, to avoid
generating invalid assembly code with differences from symbols from
two different sections (the difference cannot be computed by the
assembler).

Tests pr45701-[12].c needed a small adjustment to avoid matching
upper8_15 when looking for the r8 register.

Test no-literal-pool.c is augmented with __fp16, so it now uses
-mfp16-format=ieee.

Test thumb1-Os-mult.c generates an inline code sequence with
-mpure-code and computes the multiplication by using a sequence of
add/shift rather than using the multiply instruction, so we skip it in
presence of -mpure-code.

With -mcpu=cortex-m0, the pure-code/no-literal-pool.c fails because
code like:
static char *p = "Hello World";
char *
testchar ()
{
  return p + 4;
}

generates 2 indirections (I removed non-essential directives/code)
          .section        .rodata
	  .LC0:
	  .ascii  "Hello World\000"
	  .data
	  p:
	  .word   .LC0
	  .section        .rodata
	  .LC2:
	  .word   p
	  .section .text,"0x20000006",%progbits
	  testchar:
	  push    {r7, lr}
	  add     r7, sp, #0
	  movs    r3, #:upper8_15:#.LC2
	  lsls    r3, gcc-mirror#8
	  adds    r3, #:upper0_7:#.LC2
	  lsls    r3, gcc-mirror#8
	  adds    r3, #:lower8_15:#.LC2
	  lsls    r3, gcc-mirror#8
	  adds    r3, #:lower0_7:#.LC2
	  ldr     r3, [r3]
	  ldr     r3, [r3]
	  adds    r3, r3, gcc-mirror#4
	  movs    r0, r3
	  mov     sp, r7
	  @ sp needed
	  pop     {r7, pc}

By contrast, when using -mcpu=cortex-m4, the code looks like:
        .section        .rodata
	.LC0:
	.ascii  "Hello World\000"
	.data
	p:
	.word   .LC0
	testchar:
	push    {r7}
	add     r7, sp, #0
	movw    r3, #:lower16:p
	movt    r3, #:upper16:p
	ldr     r3, [r3]
	adds    r3, r3, gcc-mirror#4
	mov     r0, r3
	mov     sp, r7
	pop     {r7}
	bx      lr

I haven't found yet how to make code for cortex-m0 apply upper/lower
relocations to "p" instead of .LC2. The current code looks functional,
but could be improved.

2019-10-18  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm-protos.h (thumb1_gen_const_int): Add new prototype.
	* config/arm/arm.c (arm_option_check_internal): Remove restriction
	on MOVT for -mpure-code.
	(thumb1_gen_const_int): New function.
	(thumb1_legitimate_address_p): Support -mpure-code.
	(thumb1_rtx_costs): Likewise.
	(thumb1_size_rtx_costs): Likewise.
	(arm_thumb1_mi_thunk): Likewise.
	* config/arm/arm.h (CASE_VECTOR_PC_RELATIVE): Likewise.
	* config/arm/thumb1.md (thumb1_movsi_symbol_ref): New.
	(*thumb1_movhf): Support -mpure-code.

	gcc/testsuite/
	* gcc.target/arm/pr45701-1.c: Adjust for -mpure-code.
	* gcc.target/arm/pr45701-2.c: Likewise.
	* gcc.target/arm/pure-code/no-literal-pool.c: Add tests for
	__fp16.
	* gcc.target/arm/pure-code/pure-code.exp: Remove thumb2 and movt
	conditions.
	* gcc.target/arm/thumb1-Os-mult.c: Skip if -mpure-code is used.

From-SVN: r279463
kraj pushed a commit to kraj/gcc that referenced this pull request Feb 25, 2020
This patch extends support for -mpure-code to all thumb-1 processors,
by removing the need for MOVT.

Symbol addresses are built using upper8_15, upper0_7, lower8_15 and
lower0_7 relocations, and constants are built using sequences of
movs/adds and lsls instructions.

The extension of the *thumb1_movhf pattern uses always the same size
(6) although it can emit a shorter sequence when possible. This is
similar to what *arm32_movhf already does.

CASE_VECTOR_PC_RELATIVE is now false with -mpure-code, to avoid
generating invalid assembly code with differences from symbols from
two different sections (the difference cannot be computed by the
assembler).

Tests pr45701-[12].c needed a small adjustment to avoid matching
upper8_15 when looking for the r8 register.

Test no-literal-pool.c is augmented with __fp16, so it now uses
-mfp16-format=ieee.

Test thumb1-Os-mult.c generates an inline code sequence with
-mpure-code and computes the multiplication by using a sequence of
add/shift rather than using the multiply instruction, so we skip it in
presence of -mpure-code.

With -mcpu=cortex-m0, the pure-code/no-literal-pool.c fails because
code like:
static char *p = "Hello World";
char *
testchar ()
{
  return p + 4;
}

generates 2 indirections (I removed non-essential directives/code)
          .section        .rodata
	  .LC0:
	  .ascii  "Hello World\000"
	  .data
	  p:
	  .word   .LC0
	  .section        .rodata
	  .LC2:
	  .word   p
	  .section .text,"0x20000006",%progbits
	  testchar:
	  push    {r7, lr}
	  add     r7, sp, #0
	  movs    r3, #:upper8_15:#.LC2
	  lsls    r3, gcc-mirror#8
	  adds    r3, #:upper0_7:#.LC2
	  lsls    r3, gcc-mirror#8
	  adds    r3, #:lower8_15:#.LC2
	  lsls    r3, gcc-mirror#8
	  adds    r3, #:lower0_7:#.LC2
	  ldr     r3, [r3]
	  ldr     r3, [r3]
	  adds    r3, r3, gcc-mirror#4
	  movs    r0, r3
	  mov     sp, r7
	  @ sp needed
	  pop     {r7, pc}

By contrast, when using -mcpu=cortex-m4, the code looks like:
        .section        .rodata
	.LC0:
	.ascii  "Hello World\000"
	.data
	p:
	.word   .LC0
	testchar:
	push    {r7}
	add     r7, sp, #0
	movw    r3, #:lower16:p
	movt    r3, #:upper16:p
	ldr     r3, [r3]
	adds    r3, r3, gcc-mirror#4
	mov     r0, r3
	mov     sp, r7
	pop     {r7}
	bx      lr

I haven't found yet how to make code for cortex-m0 apply upper/lower
relocations to "p" instead of .LC2. The current code looks functional,
but could be improved.

2020-02-25  Christophe Lyon  <christophe.lyon@linaro.org>

	Backport from mainline
	2019-10-18  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm-protos.h (thumb1_gen_const_int): Add new prototype.
	* config/arm/arm.c (arm_option_check_internal): Remove restriction
	on MOVT for -mpure-code.
	(thumb1_gen_const_int): New function.
	(thumb1_legitimate_address_p): Support -mpure-code.
	(thumb1_rtx_costs): Likewise.
	(thumb1_size_rtx_costs): Likewise.
	(arm_thumb1_mi_thunk): Likewise.
	* config/arm/arm.h (CASE_VECTOR_PC_RELATIVE): Likewise.
	* config/arm/thumb1.md (thumb1_movsi_symbol_ref): New.
	(*thumb1_movhf): Support -mpure-code.

	gcc/testsuite/
	* gcc.target/arm/pr45701-1.c: Adjust for -mpure-code.
	* gcc.target/arm/pr45701-2.c: Likewise.
	* gcc.target/arm/pure-code/no-literal-pool.c: Add tests for
	__fp16.
	* gcc.target/arm/pure-code/pure-code.exp: Remove thumb2 and movt
	conditions.
	* gcc.target/arm/thumb1-Os-mult.c: Skip if -mpure-code is used.
kraj pushed a commit to kraj/gcc that referenced this pull request Mar 11, 2020
When using `check-function-bodies`, the subroutine `parse_function_bodies` uses
the `fluff` regexp to remove uninteresting assembly lines.

Arm targets generate assembly with some lines prefixed by `@`, these lines are
left by this process.

As an example of some lines prefixed by `@': the assembly output from the
`stacktest1` function in "bfloat16_simd_3_1.c" is:

        .align  2
        .global stacktest1
        .arch armv8.2-a
        .syntax unified
        .arm
        .fpu neon-fp-armv8
        .type   stacktest1, %function
stacktest1:
        @ args = 0, pretend = 0, frame = 8
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        sub     sp, sp, gcc-mirror#8
        add     r3, sp, gcc-mirror#6
        vst1.16 {d0[0]}, [r3]
        vld1.16 {d0[0]}, [r3]
        add     sp, sp, gcc-mirror#8
        @ sp needed
        bx      lr
        .size   stacktest1, .-stacktest1

It seems that previous uses of `check-function-bodies` in the arm backend have
avoided problems with such lines since they use the `...` regexp in each place
such fluff occurs.

I'm currently writing a patch that I'd like to match the entire function body,
so I'd like to remove such `@` lines automatically.

gcc/testsuite/ChangeLog:

2020-03-11  Matthew Malcomson  <matthew.malcomson@arm.com>

	* lib/scanasm.exp (parse_function_bodies): Lines starting with '@' also
	counted as fluff.
kraj pushed a commit to kraj/gcc that referenced this pull request Aug 12, 2020
The stack-protector-1.c test fails when compiled for Cortex-M:
- for Cortex-M0/M1, str r0, [sp #-8]! is not supported
- for Cortex-M3/M4..., the assembler complains that "use of r13 is
  deprecated"

This patch replaces the str instruction with
     sub   sp, sp, gcc-mirror#8
     str r0, [sp]
and removes the check for r13, which is unlikely to leak the canary
value.

2020-08-11  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/testsuite/
	* gcc.target/arm/stack-protector-1.c: Adapt code to Cortex-M
	restrictions.
kraj pushed a commit to kraj/gcc that referenced this pull request Aug 12, 2020
The stack-protector-1.c test fails when compiled for Cortex-M:
- for Cortex-M0/M1, str r0, [sp #-8]! is not supported
- for Cortex-M3/M4..., the assembler complains that "use of r13 is
  deprecated"

This patch replaces the str instruction with
     sub   sp, sp, gcc-mirror#8
     str r0, [sp]
and removes the check for r13, which is unlikely to leak the canary
value.

2020-08-11  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/testsuite/
	* gcc.target/arm/stack-protector-1.c: Adapt code to Cortex-M
	restrictions.

(cherry picked from commit 6606fdc)
kraj pushed a commit to kraj/gcc that referenced this pull request Aug 12, 2020
The stack-protector-1.c test fails when compiled for Cortex-M:
- for Cortex-M0/M1, str r0, [sp #-8]! is not supported
- for Cortex-M3/M4..., the assembler complains that "use of r13 is
  deprecated"

This patch replaces the str instruction with
     sub   sp, sp, gcc-mirror#8
     str r0, [sp]
and removes the check for r13, which is unlikely to leak the canary
value.

2020-08-11  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/testsuite/
	* gcc.target/arm/stack-protector-1.c: Adapt code to Cortex-M
	restrictions.

(cherry picked from commit 6606fdc)
kraj pushed a commit to kraj/gcc that referenced this pull request Sep 4, 2020
This patch moves the move-immediate splitter after the regular ones so
that it has lower precedence, and updates its constraints.

For
int f3 (void) { return 0x11000000; }
int f3_2 (void) { return 0x12345678; }

we now generate:
* with -O2 -mcpu=cortex-m0 -mpure-code:
f3:
	movs    r0, #136
	lsls    r0, r0, gcc-mirror#21
	bx      lr
f3_2:
	movs    r0, gcc-mirror#18
	lsls    r0, r0, gcc-mirror#8
	adds    r0, r0, gcc-mirror#52
	lsls    r0, r0, gcc-mirror#8
	adds    r0, r0, gcc-mirror#86
	lsls    r0, r0, gcc-mirror#8
	adds    r0, r0, #121
	bx      lr

* with -O2 -mcpu=cortex-m23 -mpure-code:
f3:
	movs    r0, #136
	lsls    r0, r0, gcc-mirror#21
	bx      lr
f3_2:
	movw    r0, #22136
	movt    r0, 4660
	bx      lr

2020-09-04  Christophe Lyon  <christophe.lyon@linaro.org>

	PR target/96769
	gcc/
	* config/arm/thumb1.md: Move movsi splitter for
	arm_disable_literal_pool after the other movsi splitters.

	gcc/testsuite/
	* gcc.target/arm/pure-code/pr96769.c: New test.
kraj pushed a commit to kraj/gcc that referenced this pull request Oct 12, 2020
Prevents the following UBSAN error:

./xgcc -B. /home/marxin/Programming/gcc/gcc/testsuite/g++.dg/torture/pr49770.C -O2 -c
/home/marxin/Programming/gcc2/gcc/ipa-modref-tree.h:482:22: runtime error: load of value 2, which is not a valid value for type 'bool'
    #0 0x1fdb4d1 in modref_tree<int>::merge(modref_tree<int>*, vec<modref_parm_map, va_heap, vl_ptr>*) /home/marxin/Programming/gcc2/gcc/ipa-modref-tree.h:482
    #1 0x1fcadaa in merge_call_side_effects(modref_summary*, gimple*, modref_summary*, bool) /home/marxin/Programming/gcc2/gcc/ipa-modref.c:511
    gcc-mirror#2 0x1fcbadd in analyze_call /home/marxin/Programming/gcc2/gcc/ipa-modref.c:642
    gcc-mirror#3 0x1fcc061 in analyze_stmt /home/marxin/Programming/gcc2/gcc/ipa-modref.c:732
    gcc-mirror#4 0x1fccf31 in analyze_function /home/marxin/Programming/gcc2/gcc/ipa-modref.c:823
    gcc-mirror#5 0x1fd17e5 in execute /home/marxin/Programming/gcc2/gcc/ipa-modref.c:1441
    gcc-mirror#6 0x25cca6e in execute_one_pass(opt_pass*) /home/marxin/Programming/gcc2/gcc/passes.c:2509
    gcc-mirror#7 0x25cd39b in execute_pass_list_1 /home/marxin/Programming/gcc2/gcc/passes.c:2597
    gcc-mirror#8 0x25cd450 in execute_pass_list_1 /home/marxin/Programming/gcc2/gcc/passes.c:2598
    gcc-mirror#9 0x25cd4ee in execute_pass_list(function*, opt_pass*) /home/marxin/Programming/gcc2/gcc/passes.c:2608
    gcc-mirror#10 0x25c7a5a in do_per_function_toporder(void (*)(function*, void*), void*) /home/marxin/Programming/gcc2/gcc/passes.c:1726
    gcc-mirror#11 0x25cfa3f in execute_ipa_pass_list(opt_pass*) /home/marxin/Programming/gcc2/gcc/passes.c:2941
    gcc-mirror#12 0x173572d in ipa_passes /home/marxin/Programming/gcc2/gcc/cgraphunit.c:2642
    gcc-mirror#13 0x17364ee in symbol_table::compile() /home/marxin/Programming/gcc2/gcc/cgraphunit.c:2777
    gcc-mirror#14 0x17372d9 in symbol_table::finalize_compilation_unit() /home/marxin/Programming/gcc2/gcc/cgraphunit.c:3022
    gcc-mirror#15 0x2a1f00a in compile_file /home/marxin/Programming/gcc2/gcc/toplev.c:485
    gcc-mirror#16 0x2a27dc8 in do_compile /home/marxin/Programming/gcc2/gcc/toplev.c:2321
    gcc-mirror#17 0x2a283cc in toplev::main(int, char**) /home/marxin/Programming/gcc2/gcc/toplev.c:2460
    gcc-mirror#18 0x54f21cd in main /home/marxin/Programming/gcc2/gcc/main.c:39
    gcc-mirror#19 0x7ffff6f0de09 in __libc_start_main ../csu/libc-start.c:314
    gcc-mirror#20 0x9eac09 in _start (/home/marxin/Programming/gcc2/objdir/gcc/cc1plus+0x9eac09)

gcc/ChangeLog:

	* ipa-modref.c (merge_call_side_effects): Clear modref_parm_map
	fields in the vector.
kraj pushed a commit to kraj/gcc that referenced this pull request Oct 16, 2020
…trinsics with -O2 (PR97271).

This patch fixes (PR97271) the wrong code-gen for mve scatter store with writeback intrinsics with -O2.

$cat bug.c
void
foo (uint32x4_t * addr, const int offset, int32x4_t value)
{
  vstrwq_scatter_base_wb_s32 (addr, 8, value);
}

$ arm-none-eabi-gcc  bug.c -S -O2 -march=armv8.1-m.main+mve -mfloat-abi=hard -o -
Without this patch:
...
foo:
	vldrw.32	q3, [r0]
	vstrw.u32       q0, [q3, gcc-mirror#8]!  ---> (A)
	vldr.64 d4, .L3
	vldr.64 d5, .L3+8
	vldrw.32	q3, [r0]
	vstrw.u32       q2, [q3, gcc-mirror#8]!  ---> (B)
	bx      lr
...

With this patch:
...
foo:
	vldrw.32	q3, [r0]
	vstrw.u32       q0, [q3, gcc-mirror#8]!  --> (C)
	vstrw.32	q3, [r0]
	bx      lr
...

Without this patch 2 vstrw assembly instructions (A and B) are generated for vstrwq_scatter_base_wb_s32
intrinsic where as fix generates only one vstrw assembly instruction (C).

gcc/ChangeLog:

2020-10-06  Srinath Parvathaneni  <srinath.parvathaneni@arm.com>

	PR target/97291
	* config/arm/arm-builtins.c (arm_strsbwbs_qualifiers): Modify array.
	(arm_strsbwbu_qualifiers): Likewise.
	(arm_strsbwbs_p_qualifiers): Likewise.
	(arm_strsbwbu_p_qualifiers): Likewise.
	* config/arm/arm_mve.h (__arm_vstrdq_scatter_base_wb_s64): Modify
	function definition.
	(__arm_vstrdq_scatter_base_wb_u64): Likewise.
	(__arm_vstrdq_scatter_base_wb_p_s64): Likewise.
	(__arm_vstrdq_scatter_base_wb_p_u64): Likewise.
	(__arm_vstrwq_scatter_base_wb_p_s32): Likewise.
	(__arm_vstrwq_scatter_base_wb_p_u32): Likewise.
	(__arm_vstrwq_scatter_base_wb_s32): Likewise.
	(__arm_vstrwq_scatter_base_wb_u32): Likewise.
	(__arm_vstrwq_scatter_base_wb_f32): Likewise.
	(__arm_vstrwq_scatter_base_wb_p_f32): Likewise.
	* config/arm/arm_mve_builtins.def (vstrwq_scatter_base_wb_add_u): Remove
	expansion for the builtin.
	(vstrwq_scatter_base_wb_add_s): Likewise.
	(vstrwq_scatter_base_wb_add_f): Likewise.
	(vstrdq_scatter_base_wb_add_u): Likewise.
	(vstrdq_scatter_base_wb_add_s): Likewise.
	(vstrwq_scatter_base_wb_p_add_u): Likewise.
	(vstrwq_scatter_base_wb_p_add_s): Likewise.
	(vstrwq_scatter_base_wb_p_add_f): Likewise.
	(vstrdq_scatter_base_wb_p_add_u): Likewise.
	(vstrdq_scatter_base_wb_p_add_s): Likewise.
	* config/arm/mve.md (mve_vstrwq_scatter_base_wb_<supf>v4si): Remove
	expand.
	(mve_vstrwq_scatter_base_wb_add_<supf>v4si): Likewise.
	(mve_vstrwq_scatter_base_wb_<supf>v4si_insn): Rename pattern to ...
	(mve_vstrwq_scatter_base_wb_<supf>v4si): This.
	(mve_vstrwq_scatter_base_wb_p_<supf>v4si): Remove expand.
	(mve_vstrwq_scatter_base_wb_p_add_<supf>v4si): Likewise.
	(mve_vstrwq_scatter_base_wb_p_<supf>v4si_insn): Rename pattern to ...
	(mve_vstrwq_scatter_base_wb_p_<supf>v4si): This.
	(mve_vstrwq_scatter_base_wb_fv4sf): Remove expand.
	(mve_vstrwq_scatter_base_wb_add_fv4sf): Likewise.
	(mve_vstrwq_scatter_base_wb_fv4sf_insn): Rename pattern to ...
	(mve_vstrwq_scatter_base_wb_fv4sf): This.
	(mve_vstrwq_scatter_base_wb_p_fv4sf): Remove expand.
	(mve_vstrwq_scatter_base_wb_p_add_fv4sf): Likewise.
	(mve_vstrwq_scatter_base_wb_p_fv4sf_insn): Rename pattern to ...
	(mve_vstrwq_scatter_base_wb_p_fv4sf): This.
	(mve_vstrdq_scatter_base_wb_<supf>v2di): Remove expand.
	(mve_vstrdq_scatter_base_wb_add_<supf>v2di): Likewise.
	(mve_vstrdq_scatter_base_wb_<supf>v2di_insn): Rename pattern to ...
	(mve_vstrdq_scatter_base_wb_<supf>v2di): This.
	(mve_vstrdq_scatter_base_wb_p_<supf>v2di): Remove expand.
	(mve_vstrdq_scatter_base_wb_p_add_<supf>v2di): Likewise.
	(mve_vstrdq_scatter_base_wb_p_<supf>v2di_insn): Rename pattern to ...
	(mve_vstrdq_scatter_base_wb_p_<supf>v2di): This.

gcc/testsuite/ChangeLog:

	PR target/97291
	* gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_p_s64.c: Modify.
	* gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_p_u64.c:
	Likewise.
	* gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_s64.c: Likewise.
	* gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_u64.c: Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_f32.c: Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_f32.c:
	Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_s32.c:
	Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_u32.c:
	Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_s32.c: Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_u32.c: Likewise.
kraj pushed a commit to kraj/gcc that referenced this pull request Oct 16, 2020
…trinsics with -O2 (PR97271).

This patch fixes (PR97271) the wrong code-gen for mve scatter store with writeback intrinsics with -O2.

$cat bug.c
void
foo (uint32x4_t * addr, const int offset, int32x4_t value)
{
  vstrwq_scatter_base_wb_s32 (addr, 8, value);
}

$ arm-none-eabi-gcc  bug.c -S -O2 -march=armv8.1-m.main+mve -mfloat-abi=hard -o -
Without this patch:
...
foo:
	vldrw.32	q3, [r0]
	vstrw.u32       q0, [q3, gcc-mirror#8]!  ---> (A)
	vldr.64 d4, .L3
	vldr.64 d5, .L3+8
	vldrw.32	q3, [r0]
	vstrw.u32       q2, [q3, gcc-mirror#8]!  ---> (B)
	bx      lr
...

With this patch:
...
foo:
	vldrw.32	q3, [r0]
	vstrw.u32       q0, [q3, gcc-mirror#8]!  --> (C)
	vstrw.32	q3, [r0]
	bx      lr
...

Without this patch 2 vstrw assembly instructions (A and B) are generated for vstrwq_scatter_base_wb_s32
intrinsic where as fix generates only one vstrw assembly instruction (C).

gcc/ChangeLog:

2020-10-06  Srinath Parvathaneni  <srinath.parvathaneni@arm.com>

	PR target/97291
	* config/arm/arm-builtins.c (arm_strsbwbs_qualifiers): Modify array.
	(arm_strsbwbu_qualifiers): Likewise.
	(arm_strsbwbs_p_qualifiers): Likewise.
	(arm_strsbwbu_p_qualifiers): Likewise.
	* config/arm/arm_mve.h (__arm_vstrdq_scatter_base_wb_s64): Modify
	function definition.
	(__arm_vstrdq_scatter_base_wb_u64): Likewise.
	(__arm_vstrdq_scatter_base_wb_p_s64): Likewise.
	(__arm_vstrdq_scatter_base_wb_p_u64): Likewise.
	(__arm_vstrwq_scatter_base_wb_p_s32): Likewise.
	(__arm_vstrwq_scatter_base_wb_p_u32): Likewise.
	(__arm_vstrwq_scatter_base_wb_s32): Likewise.
	(__arm_vstrwq_scatter_base_wb_u32): Likewise.
	(__arm_vstrwq_scatter_base_wb_f32): Likewise.
	(__arm_vstrwq_scatter_base_wb_p_f32): Likewise.
	* config/arm/arm_mve_builtins.def (vstrwq_scatter_base_wb_add_u): Remove
	expansion for the builtin.
	(vstrwq_scatter_base_wb_add_s): Likewise.
	(vstrwq_scatter_base_wb_add_f): Likewise.
	(vstrdq_scatter_base_wb_add_u): Likewise.
	(vstrdq_scatter_base_wb_add_s): Likewise.
	(vstrwq_scatter_base_wb_p_add_u): Likewise.
	(vstrwq_scatter_base_wb_p_add_s): Likewise.
	(vstrwq_scatter_base_wb_p_add_f): Likewise.
	(vstrdq_scatter_base_wb_p_add_u): Likewise.
	(vstrdq_scatter_base_wb_p_add_s): Likewise.
	* config/arm/mve.md (mve_vstrwq_scatter_base_wb_<supf>v4si): Remove
	expand.
	(mve_vstrwq_scatter_base_wb_add_<supf>v4si): Likewise.
	(mve_vstrwq_scatter_base_wb_<supf>v4si_insn): Rename pattern to ...
	(mve_vstrwq_scatter_base_wb_<supf>v4si): This.
	(mve_vstrwq_scatter_base_wb_p_<supf>v4si): Remove expand.
	(mve_vstrwq_scatter_base_wb_p_add_<supf>v4si): Likewise.
	(mve_vstrwq_scatter_base_wb_p_<supf>v4si_insn): Rename pattern to ...
	(mve_vstrwq_scatter_base_wb_p_<supf>v4si): This.
	(mve_vstrwq_scatter_base_wb_fv4sf): Remove expand.
	(mve_vstrwq_scatter_base_wb_add_fv4sf): Likewise.
	(mve_vstrwq_scatter_base_wb_fv4sf_insn): Rename pattern to ...
	(mve_vstrwq_scatter_base_wb_fv4sf): This.
	(mve_vstrwq_scatter_base_wb_p_fv4sf): Remove expand.
	(mve_vstrwq_scatter_base_wb_p_add_fv4sf): Likewise.
	(mve_vstrwq_scatter_base_wb_p_fv4sf_insn): Rename pattern to ...
	(mve_vstrwq_scatter_base_wb_p_fv4sf): This.
	(mve_vstrdq_scatter_base_wb_<supf>v2di): Remove expand.
	(mve_vstrdq_scatter_base_wb_add_<supf>v2di): Likewise.
	(mve_vstrdq_scatter_base_wb_<supf>v2di_insn): Rename pattern to ...
	(mve_vstrdq_scatter_base_wb_<supf>v2di): This.
	(mve_vstrdq_scatter_base_wb_p_<supf>v2di): Remove expand.
	(mve_vstrdq_scatter_base_wb_p_add_<supf>v2di): Likewise.
	(mve_vstrdq_scatter_base_wb_p_<supf>v2di_insn): Rename pattern to ...
	(mve_vstrdq_scatter_base_wb_p_<supf>v2di): This.

gcc/testsuite/ChangeLog:

	PR target/97291
	* gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_p_s64.c: Modify.
	* gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_p_u64.c:
	Likewise.
	* gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_s64.c: Likewise.
	* gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_u64.c: Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_f32.c: Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_f32.c:
	Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_s32.c:
	Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_u32.c:
	Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_s32.c: Likewise.
	* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_u32.c: Likewise.

(cherry picked from commit 3775358)
kraj pushed a commit to kraj/gcc that referenced this pull request Oct 19, 2020
It fixes:

/home/marxin/Programming/gcc2/gcc/ipa-modref-tree.h:482:22: runtime error: load of value 255, which is not a valid value for type 'bool'
    #0 0x18e5df3 in modref_tree<int>::merge(modref_tree<int>*, vec<modref_parm_map, va_heap, vl_ptr>*) /home/marxin/Programming/gcc2/gcc/ipa-modref-tree.h:482
    #1 0x18dc180 in ipa_merge_modref_summary_after_inlining(cgraph_edge*) /home/marxin/Programming/gcc2/gcc/ipa-modref.c:1779
    gcc-mirror#2 0x18c1c72 in inline_call(cgraph_edge*, bool, vec<cgraph_edge*, va_heap, vl_ptr>*, int*, bool, bool*) /home/marxin/Programming/gcc2/gcc/ipa-inline-transform.c:492
    gcc-mirror#3 0x4a3589c in inline_small_functions /home/marxin/Programming/gcc2/gcc/ipa-inline.c:2216
    gcc-mirror#4 0x4a3b230 in ipa_inline /home/marxin/Programming/gcc2/gcc/ipa-inline.c:2697
    gcc-mirror#5 0x4a3d902 in execute /home/marxin/Programming/gcc2/gcc/ipa-inline.c:3096
    gcc-mirror#6 0x1edf831 in execute_one_pass(opt_pass*) /home/marxin/Programming/gcc2/gcc/passes.c:2509
    gcc-mirror#7 0x1ee26af in execute_ipa_pass_list(opt_pass*) /home/marxin/Programming/gcc2/gcc/passes.c:2936
    gcc-mirror#8 0x103f31b in ipa_passes /home/marxin/Programming/gcc2/gcc/cgraphunit.c:2700
    gcc-mirror#9 0x103fb40 in symbol_table::compile() /home/marxin/Programming/gcc2/gcc/cgraphunit.c:2777
    gcc-mirror#10 0x104092b in symbol_table::finalize_compilation_unit() /home/marxin/Programming/gcc2/gcc/cgraphunit.c:3022
    gcc-mirror#11 0x235723b in compile_file /home/marxin/Programming/gcc2/gcc/toplev.c:485
    gcc-mirror#12 0x235fff9 in do_compile /home/marxin/Programming/gcc2/gcc/toplev.c:2321
    gcc-mirror#13 0x23605fc in toplev::main(int, char**) /home/marxin/Programming/gcc2/gcc/toplev.c:2460
    gcc-mirror#14 0x4e2b93b in main /home/marxin/Programming/gcc2/gcc/main.c:39
    gcc-mirror#15 0x7ffff6f0ae09 in __libc_start_main ../csu/libc-start.c:314
    gcc-mirror#16 0x9a0be9 in _start (/home/marxin/Programming/gcc2/objdir/gcc/cc1+0x9a0be9)

gcc/ChangeLog:

	* ipa-modref.c (compute_parm_map): Clear vector.
kraj pushed a commit to kraj/gcc that referenced this pull request Nov 2, 2020
Enable thumb1_gen_const_int to generate RTL or asm depending on the
context, so that we avoid duplicating code to handle constants in
Thumb-1 with -mpure-code.

Use a template so that the algorithm is effectively shared, and
rely on two classes to handle the actual emission as RTL or asm.

The generated sequence is improved to handle right-shiftable and small
values with less instructions. We now generate:

128:
        movs    r0, r0, #128
264:
        movs    r3, gcc-mirror#33
        lsls    r3, gcc-mirror#3
510:
        movs    r3, #255
        lsls    r3, #1
512:
        movs    r3, #1
        lsls    r3, gcc-mirror#9
764:
        movs    r3, #191
        lsls    r3, gcc-mirror#2
65536:
        movs    r3, #1
        lsls    r3, gcc-mirror#16
0x123456:
        movs    r3, gcc-mirror#18 ;0x12
        lsls    r3, gcc-mirror#8
        adds    r3, gcc-mirror#52 ;0x34
        lsls    r3, gcc-mirror#8
        adds    r3, gcc-mirror#86 ;0x56
0x1123456:
        movs    r3, #137 ;0x89
        lsls    r3, gcc-mirror#8
        adds    r3, gcc-mirror#26 ;0x1a
        lsls    r3, gcc-mirror#8
        adds    r3, gcc-mirror#43 ;0x2b
        lsls    r3, #1
0x1000010:
        movs    r3, gcc-mirror#16
        lsls    r3, gcc-mirror#16
        adds    r3, #1
        lsls    r3, gcc-mirror#4
0x1000011:
        movs    r3, #1
        lsls    r3, gcc-mirror#24
        adds    r3, gcc-mirror#17
-8192:
	movs	r3, #1
	lsls	r3, gcc-mirror#13
	rsbs	r3, #0

The patch adds a testcase which does not fully exercise
thumb1_gen_const_int, as other existing patterns already catch small
constants.  These parts of thumb1_gen_const_int are used by
arm_thumb1_mi_thunk.

2020-11-02  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm.c (thumb1_const_rtl, thumb1_const_print): New
	classes.
	(thumb1_gen_const_int): Rename to ...
	(thumb1_gen_const_int_1): ... New helper function. Add capability
	to emit either RTL or asm, improve generated code.
	(thumb1_gen_const_int_rtl): New function.
	* config/arm/arm-protos.h (thumb1_gen_const_int): Rename to
	thumb1_gen_const_int_rtl.
	* config/arm/thumb1.md: Call thumb1_gen_const_int_rtl instead
	of thumb1_gen_const_int.

	gcc/testsuite/
	* gcc.target/arm/pure-code/no-literal-pool-m0.c: New.
kraj pushed a commit to kraj/gcc that referenced this pull request Jan 12, 2021
This patch adds new movmisalign<mode>_mve_load and store patterns for
MVE to help vectorization. They are very similar to their Neon
counterparts, but use different iterators and instructions.

Indeed MVE supports less vectors modes than Neon, so we use the
MVE_VLD_ST iterator where Neon uses VQX.

Since the supported modes are different from the ones valid for
arithmetic operators, we introduce two new sets of macros:

ARM_HAVE_NEON_<MODE>_LDST
  true if Neon has vector load/store instructions for <MODE>

ARM_HAVE_<MODE>_LDST
  true if any vector extension has vector load/store instructions for <MODE>

We move the movmisalign<mode> expander from neon.md to vec-commond.md, and
replace the TARGET_NEON enabler with ARM_HAVE_<MODE>_LDST.

The patch also updates the mve-vneg.c test to scan for the better code
generation when loading and storing the vectors involved: it checks
that no 'orr' instruction is generated to cope with misalignment at
runtime.
This test was chosen among the other mve tests, but any other should
be OK. Using a plain vector copy loop (dest[i] = a[i]) is not a good
test because the compiler chooses to use memcpy.

For instance we now generate:
test_vneg_s32x4:
	vldrw.32       q3, [r1]
	vneg.s32  q3, q3
	vstrw.32       q3, [r0]
	bx      lr

instead of:
test_vneg_s32x4:
	orr     r3, r1, r0
	lsls    r3, r3, gcc-mirror#28
	bne     .L15
	vldrw.32	q3, [r1]
	vneg.s32  q3, q3
	vstrw.32	q3, [r0]
	bx      lr
	.L15:
	push    {r4, r5}
	ldrd    r2, r3, [r1, gcc-mirror#8]
	ldrd    r5, r4, [r1]
	rsbs    r2, r2, #0
	rsbs    r5, r5, #0
	rsbs    r4, r4, #0
	rsbs    r3, r3, #0
	strd    r5, r4, [r0]
	pop     {r4, r5}
	strd    r2, r3, [r0, gcc-mirror#8]
	bx      lr

2021-01-12  Christophe Lyon  <christophe.lyon@linaro.org>

	PR target/97875
	gcc/
	* config/arm/arm.h (ARM_HAVE_NEON_V8QI_LDST): New macro.
	(ARM_HAVE_NEON_V16QI_LDST, ARM_HAVE_NEON_V4HI_LDST): Likewise.
	(ARM_HAVE_NEON_V8HI_LDST, ARM_HAVE_NEON_V2SI_LDST): Likewise.
	(ARM_HAVE_NEON_V4SI_LDST, ARM_HAVE_NEON_V4HF_LDST): Likewise.
	(ARM_HAVE_NEON_V8HF_LDST, ARM_HAVE_NEON_V4BF_LDST): Likewise.
	(ARM_HAVE_NEON_V8BF_LDST, ARM_HAVE_NEON_V2SF_LDST): Likewise.
	(ARM_HAVE_NEON_V4SF_LDST, ARM_HAVE_NEON_DI_LDST): Likewise.
	(ARM_HAVE_NEON_V2DI_LDST): Likewise.
	(ARM_HAVE_V8QI_LDST, ARM_HAVE_V16QI_LDST): Likewise.
	(ARM_HAVE_V4HI_LDST, ARM_HAVE_V8HI_LDST): Likewise.
	(ARM_HAVE_V2SI_LDST, ARM_HAVE_V4SI_LDST, ARM_HAVE_V4HF_LDST): Likewise.
	(ARM_HAVE_V8HF_LDST, ARM_HAVE_V4BF_LDST, ARM_HAVE_V8BF_LDST): Likewise.
	(ARM_HAVE_V2SF_LDST, ARM_HAVE_V4SF_LDST, ARM_HAVE_DI_LDST): Likewise.
	(ARM_HAVE_V2DI_LDST): Likewise.
	* config/arm/mve.md (*movmisalign<mode>_mve_store): New pattern.
	(*movmisalign<mode>_mve_load): New pattern.
	* config/arm/neon.md (movmisalign<mode>): Move to ...
	* config/arm/vec-common.md: ... here.

	PR target/97875
	gcc/testsuite/
	* gcc.target/arm/simd/mve-vneg.c: Update test.
nstester pushed a commit to nstester/gcc that referenced this pull request Apr 6, 2021
This patch fixes PR99748 which shows us trying to pass the argument to
__aeabi_f2iz in the VFP register s0 when the library function is
expecting to use the GPR r0. It also fixes the __aeabi_f2uiz case which
was broken in the same way.

For the testcase in the PR, here is the code we generate before the
patch (with -mfloat-abi=hard -march=armv8.1-m.main+mve -O0):

main:
    push    {r7, lr}
    sub     sp, sp, gcc-mirror#8
    add     r7, sp, #0
    mov     r3, #1065353216
    str     r3, [r7, #4]    @ float
    vldr.32 s0, [r7, #4]
    bl      __aeabi_f2iz
    mov     r3, r0
    cmp     r3, #1
    [...]

This becomes:

main:
    push    {r7, lr}
    sub     sp, sp, gcc-mirror#8
    add     r7, sp, #0
    mov     r3, #1065353216
    str     r3, [r7, #4]    @ float
    ldr     r0, [r7, #4]    @ float
    bl      __aeabi_f2iz
    mov     r3, r0
    cmp     r3, #1
    [...]

after the patch. We see a similar change for the same testcase with a
cast to unsigned instead of int.

gcc/ChangeLog:

	PR target/99748
	* config/arm/arm.c (arm_libcall_uses_aapcs_base): Also use base
	PCS for [su]fix_optab.
kraj pushed a commit to kraj/gcc that referenced this pull request Apr 23, 2021
This patch fixes PR99748 which shows us trying to pass the argument to
__aeabi_f2iz in the VFP register s0 when the library function is
expecting to use the GPR r0. It also fixes the __aeabi_f2uiz case which
was broken in the same way.

For the testcase in the PR, here is the code we generate before the
patch (with -mfloat-abi=hard -march=armv8.1-m.main+mve -O0):

main:
    push    {r7, lr}
    sub     sp, sp, gcc-mirror#8
    add     r7, sp, #0
    mov     r3, #1065353216
    str     r3, [r7, gcc-mirror#4]    @ float
    vldr.32 s0, [r7, gcc-mirror#4]
    bl      __aeabi_f2iz
    mov     r3, r0
    cmp     r3, #1
    [...]

This becomes:

main:
    push    {r7, lr}
    sub     sp, sp, gcc-mirror#8
    add     r7, sp, #0
    mov     r3, #1065353216
    str     r3, [r7, gcc-mirror#4]    @ float
    ldr     r0, [r7, gcc-mirror#4]    @ float
    bl      __aeabi_f2iz
    mov     r3, r0
    cmp     r3, #1
    [...]

after the patch. We see a similar change for the same testcase with a
cast to unsigned instead of int.

gcc/ChangeLog:

	PR target/99748
	* config/arm/arm.c (arm_libcall_uses_aapcs_base): Also use base
	PCS for [su]fix_optab.

(cherry picked from commit 16ea7f5)
nstester pushed a commit to nstester/gcc that referenced this pull request Jun 14, 2021
The fixed error is:

==21166==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs operator delete) on 0x60300000d900
    #0 0x7367d7 in operator delete(void*, unsigned long) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/libsanitizer/asan/asan_new_delete.cpp:172
    #1 0x3b82e6e in pointer_equiv_analyzer::~pointer_equiv_analyzer() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:161
    #2 0x3b83387 in hybrid_folder::~hybrid_folder() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:517
    #3 0x3b83387 in execute_early_vrp /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:686
    #4 0x1790611 in execute_one_pass(opt_pass*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2567
    gcc-mirror#5 0x1792003 in execute_pass_list_1 /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2656
    gcc-mirror#6 0x1792029 in execute_pass_list_1 /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2657
    gcc-mirror#7 0x179209f in execute_pass_list(function*, opt_pass*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2667
    gcc-mirror#8 0x178a5f3 in do_per_function_toporder(void (*)(function*, void*), void*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:1773
    gcc-mirror#9 0x1792fac in do_per_function_toporder(void (*)(function*, void*), void*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/plugin.h:191
    gcc-mirror#10 0x1792fac in execute_ipa_pass_list(opt_pass*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:3001
    gcc-mirror#11 0xc525fc in ipa_passes /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2154
    gcc-mirror#12 0xc525fc in symbol_table::compile() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2289
    gcc-mirror#13 0xc5a096 in symbol_table::compile() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2269
    gcc-mirror#14 0xc5a096 in symbol_table::finalize_compilation_unit() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2537
    gcc-mirror#15 0x1a7a17c in compile_file /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/toplev.c:482
    gcc-mirror#16 0x69c758 in do_compile /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/toplev.c:2210
    gcc-mirror#17 0x69c758 in toplev::main(int, char**) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/toplev.c:2349
    gcc-mirror#18 0x6a932a in main /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/main.c:39
    gcc-mirror#19 0x7ffff7820b34 in __libc_start_main ../csu/libc-start.c:332
    gcc-mirror#20 0x6aa5fd in _start (/home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/objdir/gcc/cc1+0x6aa5fd)

0x60300000d900 is located 0 bytes inside of 32-byte region [0x60300000d900,0x60300000d920)
allocated by thread T0 here:
    #0 0x735ab7 in operator new[](unsigned long) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/libsanitizer/asan/asan_new_delete.cpp:102
    #1 0x3b82dac in pointer_equiv_analyzer::pointer_equiv_analyzer(gimple_ranger*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:156

gcc/ChangeLog:

	* gimple-ssa-evrp.c (pointer_equiv_analyzer::~pointer_equiv_analyzer): Use delete[].
nstester pushed a commit to nstester/gcc that referenced this pull request Jun 14, 2021
This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>,
vec_pack_trunc_<mode> patterns for MVE.

It does so by moving the unpack patterns from neon.md to
vec-common.md, while adding them support for MVE. The pack expander is
derived from the Neon one (which in turn is renamed into
neon_quad_vec_pack_trunc_<mode>).

The patch introduces mve_vec_unpack<US>_lo_<mode> and
mve_vec_unpack<US>_hi_<mode> which are similar to their Neon
counterparts, except for the assembly syntax.

The patch introduces mve_vec_pack_trunc_lo_<mode> to avoid the need for a
zero-initialized temporary, which is needed if the
vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode>
instead.

With this patch, we can now vectorize the 16 and 8-bit versions of
vclz and vshl, although the generated code could still be improved.
For test_clz_s16, we now generate
        vldrh.16        q3, [r1]
        vmovlb.s16   q2, q3
        vmovlt.s16   q3, q3
        vclz.i32  q2, q2
        vclz.i32  q3, q3
        vmovnb.i32      q1, q2
        vmovnt.i32      q1, q3
        vstrh.16        q1, [r0]
which could be improved to
        vldrh.16        q3, [r1]
	vclz.i16	q1, q3
        vstrh.16        q1, [r0]
if we could avoid the need for unpack/pack steps.

For reference, clang-12 generates:
	vldrh.s32       q0, [r1]
	vldrh.s32       q1, [r1, gcc-mirror#8]
	vclz.i32        q0, q0
	vstrh.32        q0, [r0]
	vclz.i32        q0, q1
	vstrh.32        q0, [r0, gcc-mirror#8]

2021-06-11  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/mve.md (mve_vec_unpack<US>_lo_<mode>): New pattern.
	(mve_vec_unpack<US>_hi_<mode>): New pattern.
	(@mve_vec_pack_trunc_lo_<mode>): New pattern.
	(mve_vmovntq_<supf><mode>): Prefix with '@'.
	* config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to
	vec-common.md.
	(vec_unpack<US>_lo_<mode>): Likewise.
	(vec_pack_trunc_<mode>): Rename to
	neon_quad_vec_pack_trunc_<mode>.
	* config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New
	pattern.
	(vec_unpack<US>_lo_<mode>): New.
	(vec_pack_trunc_<mode>): New.

	gcc/testsuite/
	* gcc.target/arm/simd/mve-vclz.c: Update expected results.
	* gcc.target/arm/simd/mve-vshl.c: Likewise.
	* gcc.target/arm/simd/mve-vec-pack.c: New test.
	* gcc.target/arm/simd/mve-vec-unpack.c: New test.
nstester pushed a commit to nstester/gcc that referenced this pull request Sep 13, 2021
The current restriction on folding memcpy to a single element of size
MOVE_MAX is excessively cautious on most machines and limits some
significant further optimizations.  So relax the restriction provided
the copy size does not exceed MOVE_MAX * MOVE_RATIO and that a SET
insn exists for moving the value into machine registers.

Note that there were already checks in place for having misaligned
move operations when one or more of the operands were unaligned.

On Arm this now permits optimizing

uint64_t bar64(const uint8_t *rData1)
{
    uint64_t buffer;
    memcpy(&buffer, rData1, sizeof(buffer));
    return buffer;
}

from
        ldr     r2, [r0]        @ unaligned
        sub     sp, sp, gcc-mirror#8
        ldr     r3, [r0, #4]    @ unaligned
        strd    r2, [sp]
        ldrd    r0, [sp]
        add     sp, sp, gcc-mirror#8

to
        mov     r3, r0
        ldr     r0, [r0]        @ unaligned
        ldr     r1, [r3, #4]    @ unaligned

PR target/102125 - (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations

gcc/ChangeLog:

	PR target/102125
	* gimple-fold.c (gimple_fold_builtin_memory_op): Allow folding
	memcpy if the size is not more than MOVE_MAX * MOVE_RATIO.
nstester pushed a commit to nstester/gcc that referenced this pull request Nov 26, 2021
Fixes:

==129444==ERROR: AddressSanitizer: global-buffer-overflow on address 0x00000666ca5c at pc 0x000000ef094b bp 0x7fffffff8180 sp 0x7fffffff8178
READ of size 4 at 0x00000666ca5c thread T0
    #0 0xef094a in parse_optimize_options ../../gcc/d/d-attribs.cc:855
    #1 0xef0d36 in d_handle_optimize_attribute ../../gcc/d/d-attribs.cc:916
    #2 0xef107e in d_handle_optimize_attribute ../../gcc/d/d-attribs.cc:887
    #3 0xff85b1 in decl_attributes(tree_node**, tree_node*, int, tree_node*) ../../gcc/attribs.c:829
    #4 0xef2a91 in apply_user_attributes(Dsymbol*, tree_node*) ../../gcc/d/d-attribs.cc:427
    gcc-mirror#5 0xf7b7f3 in get_symbol_decl(Declaration*) ../../gcc/d/decl.cc:1346
    gcc-mirror#6 0xf87bc7 in get_symbol_decl(Declaration*) ../../gcc/d/decl.cc:967
    gcc-mirror#7 0xf87bc7 in DeclVisitor::visit(FuncDeclaration*) ../../gcc/d/decl.cc:808
    gcc-mirror#8 0xf83db5 in DeclVisitor::build_dsymbol(Dsymbol*) ../../gcc/d/decl.cc:146

for the following test-case: gcc/testsuite/gdc.dg/attr_optimize1.d.

gcc/d/ChangeLog:

	* d-attribs.cc (parse_optimize_options): Check index before
	accessing cl_options.
kraj pushed a commit to kraj/gcc that referenced this pull request Nov 26, 2021
Fixes:

==129444==ERROR: AddressSanitizer: global-buffer-overflow on address 0x00000666ca5c at pc 0x000000ef094b bp 0x7fffffff8180 sp 0x7fffffff8178
READ of size 4 at 0x00000666ca5c thread T0
    #0 0xef094a in parse_optimize_options ../../gcc/d/d-attribs.cc:855
    #1 0xef0d36 in d_handle_optimize_attribute ../../gcc/d/d-attribs.cc:916
    gcc-mirror#2 0xef107e in d_handle_optimize_attribute ../../gcc/d/d-attribs.cc:887
    gcc-mirror#3 0xff85b1 in decl_attributes(tree_node**, tree_node*, int, tree_node*) ../../gcc/attribs.c:829
    gcc-mirror#4 0xef2a91 in apply_user_attributes(Dsymbol*, tree_node*) ../../gcc/d/d-attribs.cc:427
    gcc-mirror#5 0xf7b7f3 in get_symbol_decl(Declaration*) ../../gcc/d/decl.cc:1346
    gcc-mirror#6 0xf87bc7 in get_symbol_decl(Declaration*) ../../gcc/d/decl.cc:967
    gcc-mirror#7 0xf87bc7 in DeclVisitor::visit(FuncDeclaration*) ../../gcc/d/decl.cc:808
    gcc-mirror#8 0xf83db5 in DeclVisitor::build_dsymbol(Dsymbol*) ../../gcc/d/decl.cc:146

for the following test-case: gcc/testsuite/gdc.dg/attr_optimize1.d.

gcc/d/ChangeLog:

	* d-attribs.cc (parse_optimize_options): Check index before
	accessing cl_options.
nstester pushed a commit to nstester/gcc that referenced this pull request Dec 30, 2021
…imize or target pragmas [PR103012]

The following testcases ICE when an optimize or target pragma
is followed by a long line (4096+ chars).
This is because on such long lines we can't use columns anymore,
but the cpp_define calls performed by c_cpp_builtins_optimize_pragma
or from the backend hooks for target pragma are done on temporary
buffers and expect to get columns from whatever line they appear on
(which happens to be the long line after optimize/target pragma),
and we run into:
 #0  fancy_abort (file=0x3abec67 "../../libcpp/line-map.c", line=502, function=0x3abecfc "linemap_add") at ../../gcc/diagnostic.c:1986
 #1  0x0000000002e7c335 in linemap_add (set=0x7ffff7fca000, reason=LC_RENAME, sysp=0, to_file=0x41287a0 "pr103012.i", to_line=3) at ../../libcpp/line-map.c:502
 #2  0x0000000002e7cc24 in linemap_line_start (set=0x7ffff7fca000, to_line=3, max_column_hint=128) at ../../libcpp/line-map.c:827
 #3  0x0000000002e7ce2b in linemap_position_for_column (set=0x7ffff7fca000, to_column=1) at ../../libcpp/line-map.c:898
 #4  0x0000000002e771f9 in _cpp_lex_direct (pfile=0x40c3b60) at ../../libcpp/lex.c:3592
 gcc-mirror#5  0x0000000002e76c3e in _cpp_lex_token (pfile=0x40c3b60) at ../../libcpp/lex.c:3394
 gcc-mirror#6  0x0000000002e610ef in lex_macro_node (pfile=0x40c3b60, is_def_or_undef=true) at ../../libcpp/directives.c:601
 gcc-mirror#7  0x0000000002e61226 in do_define (pfile=0x40c3b60) at ../../libcpp/directives.c:639
 gcc-mirror#8  0x0000000002e610b2 in run_directive (pfile=0x40c3b60, dir_no=0, buf=0x7fffffffd430 "__OPTIMIZE__ 1\n", count=14) at ../../libcpp/directives.c:589
 gcc-mirror#9  0x0000000002e650c1 in cpp_define (pfile=0x40c3b60, str=0x2f784d1 "__OPTIMIZE__") at ../../libcpp/directives.c:2513
 gcc-mirror#10 0x0000000002e65100 in cpp_define_unused (pfile=0x40c3b60, str=0x2f784d1 "__OPTIMIZE__") at ../../libcpp/directives.c:2522
 gcc-mirror#11 0x0000000000f50685 in c_cpp_builtins_optimize_pragma (pfile=0x40c3b60, prev_tree=<optimization_node 0x7fffea042000>, cur_tree=<optimization_node 0x7fffea042020>)
     at ../../gcc/c-family/c-cppbuiltin.c:600
assertion that LC_RENAME doesn't happen first.

I think the right fix is emit those predefined macros upon
optimize/target pragmas with BUILTINS_LOCATION, like we already do
for those macros at the start of the TU, they don't appear in columns
of the next line after it.  Another possibility would be to force them
at the location of the pragma.

2021-12-30  Jakub Jelinek  <jakub@redhat.com>

	PR c++/103012
gcc/
	* config/i386/i386-c.c (ix86_pragma_target_parse): Perform
	cpp_define/cpp_undef calls with forced token locations
	BUILTINS_LOCATION.
	* config/arm/arm-c.c (arm_pragma_target_parse): Likewise.
	* config/aarch64/aarch64-c.c (aarch64_pragma_target_parse): Likewise.
	* config/s390/s390-c.c (s390_pragma_target_parse): Likewise.
gcc/c-family/
	* c-cppbuiltin.c (c_cpp_builtins_optimize_pragma): Perform
	cpp_define_unused/cpp_undef calls with forced token locations
	BUILTINS_LOCATION.
gcc/testsuite/
	PR c++/103012
	* g++.dg/cpp/pr103012.C: New test.
	* g++.target/i386/pr103012.C: New test.
kraj pushed a commit to kraj/gcc that referenced this pull request Jan 24, 2022
…imize or target pragmas [PR103012]

The following testcases ICE when an optimize or target pragma
is followed by a long line (4096+ chars).
This is because on such long lines we can't use columns anymore,
but the cpp_define calls performed by c_cpp_builtins_optimize_pragma
or from the backend hooks for target pragma are done on temporary
buffers and expect to get columns from whatever line they appear on
(which happens to be the long line after optimize/target pragma),
and we run into:
 #0  fancy_abort (file=0x3abec67 "../../libcpp/line-map.c", line=502, function=0x3abecfc "linemap_add") at ../../gcc/diagnostic.c:1986
 #1  0x0000000002e7c335 in linemap_add (set=0x7ffff7fca000, reason=LC_RENAME, sysp=0, to_file=0x41287a0 "pr103012.i", to_line=3) at ../../libcpp/line-map.c:502
 gcc-mirror#2  0x0000000002e7cc24 in linemap_line_start (set=0x7ffff7fca000, to_line=3, max_column_hint=128) at ../../libcpp/line-map.c:827
 gcc-mirror#3  0x0000000002e7ce2b in linemap_position_for_column (set=0x7ffff7fca000, to_column=1) at ../../libcpp/line-map.c:898
 gcc-mirror#4  0x0000000002e771f9 in _cpp_lex_direct (pfile=0x40c3b60) at ../../libcpp/lex.c:3592
 gcc-mirror#5  0x0000000002e76c3e in _cpp_lex_token (pfile=0x40c3b60) at ../../libcpp/lex.c:3394
 gcc-mirror#6  0x0000000002e610ef in lex_macro_node (pfile=0x40c3b60, is_def_or_undef=true) at ../../libcpp/directives.c:601
 gcc-mirror#7  0x0000000002e61226 in do_define (pfile=0x40c3b60) at ../../libcpp/directives.c:639
 gcc-mirror#8  0x0000000002e610b2 in run_directive (pfile=0x40c3b60, dir_no=0, buf=0x7fffffffd430 "__OPTIMIZE__ 1\n", count=14) at ../../libcpp/directives.c:589
 gcc-mirror#9  0x0000000002e650c1 in cpp_define (pfile=0x40c3b60, str=0x2f784d1 "__OPTIMIZE__") at ../../libcpp/directives.c:2513
 gcc-mirror#10 0x0000000002e65100 in cpp_define_unused (pfile=0x40c3b60, str=0x2f784d1 "__OPTIMIZE__") at ../../libcpp/directives.c:2522
 gcc-mirror#11 0x0000000000f50685 in c_cpp_builtins_optimize_pragma (pfile=0x40c3b60, prev_tree=<optimization_node 0x7fffea042000>, cur_tree=<optimization_node 0x7fffea042020>)
     at ../../gcc/c-family/c-cppbuiltin.c:600
assertion that LC_RENAME doesn't happen first.

I think the right fix is emit those predefined macros upon
optimize/target pragmas with BUILTINS_LOCATION, like we already do
for those macros at the start of the TU, they don't appear in columns
of the next line after it.  Another possibility would be to force them
at the location of the pragma.

2021-12-30  Jakub Jelinek  <jakub@redhat.com>

	PR c++/103012
gcc/
	* config/i386/i386-c.c (ix86_pragma_target_parse): Perform
	cpp_define/cpp_undef calls with forced token locations
	BUILTINS_LOCATION.
	* config/arm/arm-c.c (arm_pragma_target_parse): Likewise.
	* config/aarch64/aarch64-c.c (aarch64_pragma_target_parse): Likewise.
	* config/s390/s390-c.c (s390_pragma_target_parse): Likewise.
gcc/c-family/
	* c-cppbuiltin.c (c_cpp_builtins_optimize_pragma): Perform
	cpp_define_unused/cpp_undef calls with forced token locations
	BUILTINS_LOCATION.
gcc/testsuite/
	PR c++/103012
	* g++.dg/cpp/pr103012.C: New test.
	* g++.target/i386/pr103012.C: New test.

(cherry picked from commit 1dbe26b)
vathpela pushed a commit to vathpela/gcc that referenced this pull request Feb 22, 2022
…04617]

On
 #define A(n) int foo1##n(void) { return 1##n; }
 #define B(n) A(n##0) A(n#gcc-mirror#1) A(n#gcc-mirror#2) A(n#gcc-mirror#3) A(n#gcc-mirror#4) A(n#gcc-mirror#5) A(n#gcc-mirror#6) A(n#gcc-mirror#7) A(n#gcc-mirror#8) A(n#gcc-mirror#9)
 #define C(n) B(n##0) B(n#gcc-mirror#1) B(n#gcc-mirror#2) B(n#gcc-mirror#3) B(n#gcc-mirror#4) B(n#gcc-mirror#5) B(n#gcc-mirror#6) B(n#gcc-mirror#7) B(n#gcc-mirror#8) B(n#gcc-mirror#9)
 #define D(n) C(n##0) C(n#gcc-mirror#1) C(n#gcc-mirror#2) C(n#gcc-mirror#3) C(n#gcc-mirror#4) C(n#gcc-mirror#5) C(n#gcc-mirror#6) C(n#gcc-mirror#7) C(n#gcc-mirror#8) C(n#gcc-mirror#9)
 #define E(n) D(n##0) D(n#gcc-mirror#1) D(n#gcc-mirror#2) D(n#gcc-mirror#3) D(n#gcc-mirror#4) D(n#gcc-mirror#5) D(n#gcc-mirror#6) D(n#gcc-mirror#7) D(n#gcc-mirror#8) D(n#gcc-mirror#9)
 E(0) E(1) E(2) D(30) D(31) C(320) C(321) C(322) C(323) C(324) C(325)
 B(3260) B(3261) B(3262) B(3263) A(32640) A(32641) A(32642)
testcase with
./xgcc -B ./ -c -g -fpic -ffat-lto-objects -flto  -O0 -o foo1.o foo1.c -ffunction-sections
./xgcc -B ./ -shared -g -fpic -flto -O0 -o foo1.so foo1.o
/tmp/ccTW8mBm.debug.temp.o: file not recognized: file format not recognized
(testcase too slow to be included into testsuite).
The problem is clearly reported by readelf:
readelf: foo1.o.debug.temp.o: Warning: Section 2 has an out of range sh_link value of 65321
readelf: foo1.o.debug.temp.o: Warning: Section 5 has an out of range sh_link value of 65321
readelf: foo1.o.debug.temp.o: Warning: Section 10 has an out of range sh_link value of 65323
readelf: foo1.o.debug.temp.o: Warning: [ 2]: Link field (65321) should index a symtab section.
readelf: foo1.o.debug.temp.o: Warning: [ 5]: Link field (65321) should index a symtab section.
readelf: foo1.o.debug.temp.o: Warning: [10]: Link field (65323) should index a string section.
because simple_object_elf_copy_lto_debug_sections doesn't adjust sh_info and
sh_link fields in ElfNN_Shdr if they are in between SHN_{LO,HI}RESERVE
inclusive.  Not adjusting those is incorrect though, SHN_{LO,HI}RESERVE
range is only relevant to the 16-bit fields, mainly st_shndx in ElfNN_Sym
where if one needs >= SHN_LORESERVE section number, SHN_XINDEX should be
used instead and .symtab_shndx section should contain the real section
index, and in ElfNN_Ehdr e_shnum and e_shstrndx fields, where if >=
SHN_LORESERVE value is needed it should put those into
Shdr[0].sh_{size,link}.  But, sh_{link,info} are 32-bit fields which can
contain any section index.

Note, as simple-object-elf.c mentions, binutils from 2.12 to 2.18 (so before
2011) used to mishandle the > 63.75K sections case and assumed there is a
hole in between the sections, but what
simple_object_elf_copy_lto_debug_sections does wouldn't help in that case
for the debug temp object creation, we'd need to detect the case also in
that routine and take it into account in the remapping etc.  I think
it is not worth it given that it is over 10 years, if somebody needs
63.75K or more sections, better use more recent binutils.

2022-02-22  Jakub Jelinek  <jakub@redhat.com>

	PR lto/104617
	* simple-object-elf.c (simple_object_elf_match): Fix up URL
	in comment.
	(simple_object_elf_copy_lto_debug_sections): Remap sh_info and
	sh_link even if they are in the SHN_LORESERVE .. SHN_HIRESERVE
	range (inclusive).
kraj pushed a commit to kraj/gcc that referenced this pull request May 10, 2022
…04617]

On
 #define A(n) int foo1##n(void) { return 1##n; }
 #define B(n) A(n##0) A(n##1) A(n#gcc-mirror#2) A(n#gcc-mirror#3) A(n#gcc-mirror#4) A(n#gcc-mirror#5) A(n#gcc-mirror#6) A(n#gcc-mirror#7) A(n#gcc-mirror#8) A(n#gcc-mirror#9)
 #define C(n) B(n##0) B(n##1) B(n#gcc-mirror#2) B(n#gcc-mirror#3) B(n#gcc-mirror#4) B(n#gcc-mirror#5) B(n#gcc-mirror#6) B(n#gcc-mirror#7) B(n#gcc-mirror#8) B(n#gcc-mirror#9)
 #define D(n) C(n##0) C(n##1) C(n#gcc-mirror#2) C(n#gcc-mirror#3) C(n#gcc-mirror#4) C(n#gcc-mirror#5) C(n#gcc-mirror#6) C(n#gcc-mirror#7) C(n#gcc-mirror#8) C(n#gcc-mirror#9)
 #define E(n) D(n##0) D(n##1) D(n#gcc-mirror#2) D(n#gcc-mirror#3) D(n#gcc-mirror#4) D(n#gcc-mirror#5) D(n#gcc-mirror#6) D(n#gcc-mirror#7) D(n#gcc-mirror#8) D(n#gcc-mirror#9)
 E(0) E(1) E(2) D(30) D(31) C(320) C(321) C(322) C(323) C(324) C(325)
 B(3260) B(3261) B(3262) B(3263) A(32640) A(32641) A(32642)
testcase with
./xgcc -B ./ -c -g -fpic -ffat-lto-objects -flto  -O0 -o foo1.o foo1.c -ffunction-sections
./xgcc -B ./ -shared -g -fpic -flto -O0 -o foo1.so foo1.o
/tmp/ccTW8mBm.debug.temp.o: file not recognized: file format not recognized
(testcase too slow to be included into testsuite).
The problem is clearly reported by readelf:
readelf: foo1.o.debug.temp.o: Warning: Section 2 has an out of range sh_link value of 65321
readelf: foo1.o.debug.temp.o: Warning: Section 5 has an out of range sh_link value of 65321
readelf: foo1.o.debug.temp.o: Warning: Section 10 has an out of range sh_link value of 65323
readelf: foo1.o.debug.temp.o: Warning: [ 2]: Link field (65321) should index a symtab section.
readelf: foo1.o.debug.temp.o: Warning: [ 5]: Link field (65321) should index a symtab section.
readelf: foo1.o.debug.temp.o: Warning: [10]: Link field (65323) should index a string section.
because simple_object_elf_copy_lto_debug_sections doesn't adjust sh_info and
sh_link fields in ElfNN_Shdr if they are in between SHN_{LO,HI}RESERVE
inclusive.  Not adjusting those is incorrect though, SHN_{LO,HI}RESERVE
range is only relevant to the 16-bit fields, mainly st_shndx in ElfNN_Sym
where if one needs >= SHN_LORESERVE section number, SHN_XINDEX should be
used instead and .symtab_shndx section should contain the real section
index, and in ElfNN_Ehdr e_shnum and e_shstrndx fields, where if >=
SHN_LORESERVE value is needed it should put those into
Shdr[0].sh_{size,link}.  But, sh_{link,info} are 32-bit fields which can
contain any section index.

Note, as simple-object-elf.c mentions, binutils from 2.12 to 2.18 (so before
2011) used to mishandle the > 63.75K sections case and assumed there is a
hole in between the sections, but what
simple_object_elf_copy_lto_debug_sections does wouldn't help in that case
for the debug temp object creation, we'd need to detect the case also in
that routine and take it into account in the remapping etc.  I think
it is not worth it given that it is over 10 years, if somebody needs
63.75K or more sections, better use more recent binutils.

2022-02-22  Jakub Jelinek  <jakub@redhat.com>

	PR lto/104617
	* simple-object-elf.c (simple_object_elf_match): Fix up URL
	in comment.
	(simple_object_elf_copy_lto_debug_sections): Remap sh_info and
	sh_link even if they are in the SHN_LORESERVE .. SHN_HIRESERVE
	range (inclusive).

(cherry picked from commit 2f59f06)
kraj pushed a commit to kraj/gcc that referenced this pull request May 11, 2022
…04617]

On
 #define A(n) int foo1##n(void) { return 1##n; }
 #define B(n) A(n##0) A(n##1) A(n#gcc-mirror#2) A(n#gcc-mirror#3) A(n#gcc-mirror#4) A(n#gcc-mirror#5) A(n#gcc-mirror#6) A(n#gcc-mirror#7) A(n#gcc-mirror#8) A(n#gcc-mirror#9)
 #define C(n) B(n##0) B(n##1) B(n#gcc-mirror#2) B(n#gcc-mirror#3) B(n#gcc-mirror#4) B(n#gcc-mirror#5) B(n#gcc-mirror#6) B(n#gcc-mirror#7) B(n#gcc-mirror#8) B(n#gcc-mirror#9)
 #define D(n) C(n##0) C(n##1) C(n#gcc-mirror#2) C(n#gcc-mirror#3) C(n#gcc-mirror#4) C(n#gcc-mirror#5) C(n#gcc-mirror#6) C(n#gcc-mirror#7) C(n#gcc-mirror#8) C(n#gcc-mirror#9)
 #define E(n) D(n##0) D(n##1) D(n#gcc-mirror#2) D(n#gcc-mirror#3) D(n#gcc-mirror#4) D(n#gcc-mirror#5) D(n#gcc-mirror#6) D(n#gcc-mirror#7) D(n#gcc-mirror#8) D(n#gcc-mirror#9)
 E(0) E(1) E(2) D(30) D(31) C(320) C(321) C(322) C(323) C(324) C(325)
 B(3260) B(3261) B(3262) B(3263) A(32640) A(32641) A(32642)
testcase with
./xgcc -B ./ -c -g -fpic -ffat-lto-objects -flto  -O0 -o foo1.o foo1.c -ffunction-sections
./xgcc -B ./ -shared -g -fpic -flto -O0 -o foo1.so foo1.o
/tmp/ccTW8mBm.debug.temp.o: file not recognized: file format not recognized
(testcase too slow to be included into testsuite).
The problem is clearly reported by readelf:
readelf: foo1.o.debug.temp.o: Warning: Section 2 has an out of range sh_link value of 65321
readelf: foo1.o.debug.temp.o: Warning: Section 5 has an out of range sh_link value of 65321
readelf: foo1.o.debug.temp.o: Warning: Section 10 has an out of range sh_link value of 65323
readelf: foo1.o.debug.temp.o: Warning: [ 2]: Link field (65321) should index a symtab section.
readelf: foo1.o.debug.temp.o: Warning: [ 5]: Link field (65321) should index a symtab section.
readelf: foo1.o.debug.temp.o: Warning: [10]: Link field (65323) should index a string section.
because simple_object_elf_copy_lto_debug_sections doesn't adjust sh_info and
sh_link fields in ElfNN_Shdr if they are in between SHN_{LO,HI}RESERVE
inclusive.  Not adjusting those is incorrect though, SHN_{LO,HI}RESERVE
range is only relevant to the 16-bit fields, mainly st_shndx in ElfNN_Sym
where if one needs >= SHN_LORESERVE section number, SHN_XINDEX should be
used instead and .symtab_shndx section should contain the real section
index, and in ElfNN_Ehdr e_shnum and e_shstrndx fields, where if >=
SHN_LORESERVE value is needed it should put those into
Shdr[0].sh_{size,link}.  But, sh_{link,info} are 32-bit fields which can
contain any section index.

Note, as simple-object-elf.c mentions, binutils from 2.12 to 2.18 (so before
2011) used to mishandle the > 63.75K sections case and assumed there is a
hole in between the sections, but what
simple_object_elf_copy_lto_debug_sections does wouldn't help in that case
for the debug temp object creation, we'd need to detect the case also in
that routine and take it into account in the remapping etc.  I think
it is not worth it given that it is over 10 years, if somebody needs
63.75K or more sections, better use more recent binutils.

2022-02-22  Jakub Jelinek  <jakub@redhat.com>

	PR lto/104617
	* simple-object-elf.c (simple_object_elf_match): Fix up URL
	in comment.
	(simple_object_elf_copy_lto_debug_sections): Remap sh_info and
	sh_link even if they are in the SHN_LORESERVE .. SHN_HIRESERVE
	range (inclusive).

(cherry picked from commit 2f59f06)
xionghul pushed a commit to xionghul/gcc that referenced this pull request Jul 18, 2022
This patch extends the fix for PR106253 to AArch32.  As with AArch64,
we were using ACLE intrinsics to vectorise scalar built-ins, even
though the two sometimes have different ECF_* flags.  (That in turn
is because the ACLE intrinsics should follow the instruction semantics
as closely as possible, whereas the scalar built-ins follow language
specs.)

The patch also removes the copysignf built-in, which only existed
for this purpose and wasn't a “real” arm_neon.h built-in.

Doing this also has the side-effect of enabling vectorisation of
rint and roundeven.  Logically that should be a separate patch,
but making it one would have meant adding a new int iterator
for the original set of instructions and then removing it again
when including new functions.

I've restricted the bswap tests to little-endian because we end
up with excessive spilling on big-endian.  E.g.:

        sub     sp, sp, gcc-mirror#8
        vstr    d1, [sp]
        vldr    d16, [sp]
        vrev16.8        d16, d16
        vstr    d16, [sp]
        vldr    d0, [sp]
        add     sp, sp, gcc-mirror#8
        @ sp needed
        bx      lr

Similarly, the copysign tests require little-endian because on
big-endian we unnecessarily load the constant from the constant pool:

        vldr.32 s15, .L3
        vdup.32 d0, d7[1]
        vbsl    d0, d2, d1
        bx      lr
.L3:
        .word   -2147483648

gcc/
	PR target/106253
	* config/arm/arm-builtins.cc (arm_builtin_vectorized_function):
	Delete.
	* config/arm/arm-protos.h (arm_builtin_vectorized_function): Delete.
	* config/arm/arm.cc (TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION):
	Delete.
	* config/arm/arm_neon_builtins.def (copysignf): Delete.
	* config/arm/iterators.md (nvrint_pattern): New attribute.
	* config/arm/neon.md (<NEON_VRINT:nvrint_pattern><VCVTF:mode>2):
	New pattern.
	(l<NEON_VCVT:nvrint_pattern><su_optab><VCVTF:mode><v_cmp_result>2):
	Likewise.
	(neon_copysignf<mode>): Rename to...
	(copysign<mode>3): ...this.

gcc/testsuite/
	PR target/106253
	* gcc.target/arm/vect_unary_1.c: New test.
	* gcc.target/arm/vect_binary_1.c: Likewise.
xionghul pushed a commit to xionghul/gcc that referenced this pull request Aug 30, 2022
Currently SLP tries to force permute operations "down" the graph
from loads in the hope of reducing the total number of permutations
needed or (in the best case) removing the need for the permutations
entirely.  This patch tries to extend it as follows:

- Allow loads to take a different permutation from the one they
  started with, rather than choosing between "original permutation"
  and "no permutation".

- Allow changes in both directions, if the target supports the
  reverse permutation.

- Treat the placement of permutations as a two-way dataflow problem:
  after propagating information from leaves to roots (as now), propagate
  information back up the graph.

- Take execution frequency into account when optimising for speed,
  so that (for example) permutations inside loops have a higher
  cost than permutations outside loops.

- Try to reduce the total number of permutations when optimising for
  size, even if that increases the number of permutations on a given
  execution path.

See the big block comment above vect_optimize_slp_pass for
a detailed description.

The original motivation for doing this was to add a framework that would
allow other layout differences in future.  The two main ones are:

- Make it easier to represent predicated operations, including
  predicated operations with gaps.  E.g.:

     a[0] += 1;
     a[1] += 1;
     a[3] += 1;

  could be a single load/add/store for SVE.  We could handle this
  by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
  (depending on what's being counted).  We might need to move
  elements between lanes at various points, like with permutes.

  (This would first mean adding support for stores with gaps.)

- Make it easier to switch between an even/odd and unpermuted layout
  when switching between wide and narrow elements.  E.g. if a widening
  operation produces an even vector and an odd vector, we should try
  to keep operations on the wide elements in that order rather than
  force them to be permuted back "in order".

To give some examples of what the patch does:

int f1(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  a[0] = (b[1] << c[3]) - d[1];
  a[1] = (b[0] << c[2]) - d[0];
  a[2] = (b[3] << c[1]) - d[3];
  a[3] = (b[2] << c[0]) - d[2];
}

continues to produce the same code as before when optimising for
speed: b, c and d are permuted at load time.  But when optimising
for size we instead permute c into the same order as b+d and then
permute the result of the arithmetic into the same order as a:

        ldr     q1, [x2]
        ldr     q0, [x1]
        ext     v1.16b, v1.16b, v1.16b, gcc-mirror#8     // <------
        sshl    v0.4s, v0.4s, v1.4s
        ldr     q1, [x3]
        sub     v0.4s, v0.4s, v1.4s
        rev64   v0.4s, v0.4s                   // <------
        str     q0, [x0]
        ret

The following function:

int f2(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  a[0] = (b[3] << c[3]) - d[3];
  a[1] = (b[2] << c[2]) - d[2];
  a[2] = (b[1] << c[1]) - d[1];
  a[3] = (b[0] << c[0]) - d[0];
}

continues to push the reverse down to just before the store,
like the previous code did.

In:

int f3(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  for (int i = 0; i < 100; ++i)
    {
      a[0] = (a[0] + c[3]);
      a[1] = (a[1] + c[2]);
      a[2] = (a[2] + c[1]);
      a[3] = (a[3] + c[0]);
      c += 4;
    }
}

the loads of a are hoisted and the stores of a are sunk, so that
only the load from c happens in the loop.  When optimising for
speed, we prefer to have the loop operate on the reversed layout,
changing on entry and exit from the loop:

        mov     x3, x0
        adrp    x0, .LC0
        add     x1, x2, 1600
        ldr     q2, [x0, #:lo12:.LC0]
        ldr     q0, [x3]
        mov     v1.16b, v0.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
        .p2align 3,,7
.L6:
        ldr     q1, [x2], 16
        add     v0.4s, v0.4s, v1.4s
        cmp     x2, x1
        bne     .L6
        mov     v1.16b, v0.16b
        adrp    x0, .LC0
        ldr     q2, [x0, #:lo12:.LC0]
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
        str     q0, [x3]
        ret

Similarly, for the very artificial testcase:

int f4(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  int a0 = a[0];
  int a1 = a[1];
  int a2 = a[2];
  int a3 = a[3];
  for (int i = 0; i < 100; ++i)
    {
      a0 ^= c[0];
      a1 ^= c[1];
      a2 ^= c[2];
      a3 ^= c[3];
      c += 4;
      for (int j = 0; j < 100; ++j)
	{
	  a0 += d[1];
	  a1 += d[0];
	  a2 += d[3];
	  a3 += d[2];
	  d += 4;
	}
      b[0] = a0;
      b[1] = a1;
      b[2] = a2;
      b[3] = a3;
      b += 4;
    }
  a[0] = a0;
  a[1] = a1;
  a[2] = a2;
  a[3] = a3;
}

the a vector in the inner loop maintains the order { 1, 0, 3, 2 },
even though it's part of an SCC that includes the outer loop.
In other words, this is a motivating case for not assigning
permutes at SCC granularity.  The code we get is:

        ldr     q0, [x0]
        mov     x4, x1
        mov     x5, x0
        add     x1, x3, 1600
        add     x3, x4, 1600
        .p2align 3,,7
.L11:
        ldr     q1, [x2], 16
        sub     x0, x1, #1600
        eor     v0.16b, v1.16b, v0.16b
        rev64   v0.4s, v0.4s              // <---
        .p2align 3,,7
.L10:
        ldr     q1, [x0], 16
        add     v0.4s, v0.4s, v1.4s
        cmp     x0, x1
        bne     .L10
        rev64   v0.4s, v0.4s              // <---
        add     x1, x0, 1600
        str     q0, [x4], 16
        cmp     x3, x4
        bne     .L11
        str     q0, [x5]
        ret

bb-slp-layout-17.c is a collection of compile tests for problems
I hit with earlier versions of the patch.  The same prolems might
show up elsewhere, but it seemed worth having the test anyway.

In slp-11b.c we previously pushed the permutation of the in[i*4]
group down from the load to just before the store.  That didn't
reduce the number or frequency of the permutations (or increase
them either).  But separating the permute from the load meant
that we could no longer use load/store lanes.

Whether load/store lanes are a good idea here is another question.
If there were two sets of loads, and if we could use a single
permutation instead of one per load, then avoiding load/store
lanes should be a good thing even under the current abstract
cost model.  But I think under the current model we should
try to avoid splitting up potential load/store lanes groups
if there is no specific benefit to the split.

Preferring load/store lanes is still a source of missed optimisations
that we should fix one day...

gcc/
	* params.opt (-param=vect-max-layout-candidates=): New parameter.
	* doc/invoke.texi (vect-max-layout-candidates): Document it.
	* tree-vectorizer.h (auto_lane_permutation_t): New typedef.
	(auto_load_permutation_t): Likewise.
	* tree-vect-slp.cc (vect_slp_node_weight): New function.
	(slpg_layout_cost): New class.
	(slpg_vertex): Replace perm_in and perm_out with partition,
	out_degree, weight and out_weight.
	(slpg_partition_info, slpg_partition_layout_costs): New classes.
	(vect_optimize_slp_pass): Likewise, cannibalizing some part of
	the previous vect_optimize_slp.
	(vect_optimize_slp): Use it.

gcc/testsuite/
	* lib/target-supports.exp (check_effective_target_vect_var_shift):
	Return true for aarch64.
	* gcc.dg/vect/bb-slp-layout-1.c: New test.
	* gcc.dg/vect/bb-slp-layout-2.c: New test.
	* gcc.dg/vect/bb-slp-layout-3.c: New test.
	* gcc.dg/vect/bb-slp-layout-4.c: New test.
	* gcc.dg/vect/bb-slp-layout-5.c: New test.
	* gcc.dg/vect/bb-slp-layout-6.c: New test.
	* gcc.dg/vect/bb-slp-layout-7.c: New test.
	* gcc.dg/vect/bb-slp-layout-8.c: New test.
	* gcc.dg/vect/bb-slp-layout-9.c: New test.
	* gcc.dg/vect/bb-slp-layout-10.c: New test.
	* gcc.dg/vect/bb-slp-layout-11.c: New test.
	* gcc.dg/vect/bb-slp-layout-13.c: New test.
	* gcc.dg/vect/bb-slp-layout-14.c: New test.
	* gcc.dg/vect/bb-slp-layout-15.c: New test.
	* gcc.dg/vect/bb-slp-layout-16.c: New test.
	* gcc.dg/vect/bb-slp-layout-17.c: New test.
	* gcc.dg/vect/slp-11b.c: XFAIL SLP test for load-lanes targets.
xionghul pushed a commit to xionghul/gcc that referenced this pull request Jan 28, 2023
The aarch64 ISA specification allows a left shift amount to be applied
after extension in the range of 0 to 4 (encoded in the imm3 field).

This is true for at least the following instructions:

 * ADD (extend register)
 * ADDS (extended register)
 * SUB (extended register)

The result of this patch can be seen, when compiling the following code:

uint64_t myadd(uint64_t a, uint64_t b)
{
    return a+(((uint8_t)b)<<4);
}

Without the patch the following sequence will be generated:

0000000000000000 <myadd>:
   0:	d37c1c21 	ubfiz	x1, x1, gcc-mirror#4, gcc-mirror#8
   4:	8b000020 	add	x0, x1, x0
   8:	d65f03c0 	ret

With the patch the ubfiz will be merged into the add instruction:

0000000000000000 <myadd>:
   0:	8b211000 	add	x0, x0, w1, uxtb gcc-mirror#4
   4:	d65f03c0 	ret

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_uxt_size): fix an
	off-by-one in checking the permissible shift-amount.
xionghul pushed a commit to xionghul/gcc that referenced this pull request Mar 12, 2023
…hook [PR108583]

This replaces the custom division hook with just an implementation through
add_highpart.  For NEON we implement the add highpart (Addition + extraction of
the upper highpart of the register in the same precision) as ADD + LSR.

This representation allows us to easily optimize the sequence using existing
sequences. This gets us a pretty decent sequence using SRA:

        umull   v1.8h, v0.8b, v3.8b
        umull2  v0.8h, v0.16b, v3.16b
        add     v5.8h, v1.8h, v2.8h
        add     v4.8h, v0.8h, v2.8h
        usra    v1.8h, v5.8h, 8
        usra    v0.8h, v4.8h, 8
        uzp2    v1.16b, v1.16b, v0.16b

To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
is half the precision of the mode of the operation into addhn + uaddw which is
a general good optimization on its own and gets us back to:

.L4:
        ldr     q0, [x3]
        umull   v1.8h, v0.8b, v5.8b
        umull2  v0.8h, v0.16b, v5.16b
        addhn   v3.8b, v1.8h, v4.8h
        addhn   v2.8b, v0.8h, v4.8h
        uaddw   v1.8h, v1.8h, v3.8b
        uaddw   v0.8h, v0.8h, v2.8b
        uzp2    v1.16b, v1.16b, v0.16b
        str     q1, [x3], 16
        cmp     x3, x4
        bne     .L4

For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        add     z1.h, z0.h, z3.h
        usra    z0.h, z1.h, gcc-mirror#8
        lsr     z0.h, z0.h, gcc-mirror#8
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3
.L1:
        ret

and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
to addhnb which gets us to:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        addhnb  z1.b, z0.h, z3.h
        addhnb  z0.b, z0.h, z1.h
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3

There are multiple RTL representations possible for these optimizations, I did
not represent them using a zero_extend because we seem very inconsistent in this
in the backend.  Since they are unspecs we won't match them from vector ops
anyway. I figured maintainers would prefer this, but my maintainer ouija board
is still out for repairs :)

There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.

gcc/ChangeLog:

	PR target/108583
	* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
	(*bitmask_shift_plus<mode>): New.
	* config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New.
	(@aarch64_bitmask_udiv<mode>3): Remove.
	* config/aarch64/aarch64.cc
	(aarch64_vectorize_can_special_div_by_constant,
	TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
	(TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
	aarch64_vectorize_preferred_div_as_shifts_over_mult): New.
nstester pushed a commit to nstester/gcc that referenced this pull request Apr 26, 2023
This patch adds support for xstormy16's swpb (swap bytes) and swpw (swap
words) instructions.  The most obvious application of these to implement
the __builtin_bswap16 and __builtin_bswap32 intrinsics.

Currently, __builtin_bswap16 is implemented as:
foo:    mov r7,r2
        shl r7,gcc-mirror#8
        shr r2,gcc-mirror#8
        or r2,r7
        ret

but with this patch becomes:
foo:	swpb r2
	ret

Likewise, __builtin_bswap32 now becomes:
foo:	swpb r2 | swpb r3 | swpw r2,r3
        ret

Finally, the swpw instruction on its own can be used to exchange
two word mode registers without a temporary, so a new pattern and
peephole2 have been added to catch this.  As described in the
PR rtl-optimization/106518, register allocation can (in theory)
be more efficient on targets that provide a swap/exchange instruction.
The slightly unusual swap<mode> naming matches that used in i386.md.

2024-04-26  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* config/stormy16/stormy16.md (bswaphi2): New define_insn.
	(bswapsi2): New define_insn.
	(swaphi): New define_insn to exchange two registers (swpw).
	(define_peephole2): Recognize exchange of registers as swaphi.

gcc/testsuite/ChangeLog
	* gcc.target/xstormy16/bswap16.c: New test case.
	* gcc.target/xstormy16/bswap32.c: Likewise.
	* gcc.target/xstormy16/swpb.c: Likewise.
	* gcc.target/xstormy16/swpw-1.c: Likewise.
	* gcc.target/xstormy16/swpw-2.c: Likewise.
nstester pushed a commit to nstester/gcc that referenced this pull request Aug 4, 2023
This patch is the final piece in the series to improve the ABI issues
affecting PR 88873.  The previous patches tackled inserting DFmode
values into V2DFmode registers, by introducing insvti_{low,high}part
patterns.  This patch improves the extraction of DFmode values from
V2DFmode registers via TImode intermediates.

I'd initially thought this would require new extvti_{low,high}part
patterns to be defined, but all that's required is to recognize that
the SUBREG idioms produced by combine are equivalent to (forms of)
vec_select patterns.  The target-independent middle-end can't be sure
that the appropriate vec_select instruction exists on the target,
hence doesn't canonicalize a SUBREG of a vector mode as a vec_select,
but the backend can provide a define_split stating where and when
this is useful, for example, considering whether the operand is in
memory, or whether !TARGET_SSE_MATH and the destination is i387.

For pr88873.c, gcc -O2 -march=cascadelake currently generates:

foo:    vpunpcklqdq     %xmm3, %xmm2, %xmm7
        vpunpcklqdq     %xmm1, %xmm0, %xmm6
        vpunpcklqdq     %xmm5, %xmm4, %xmm2
        vmovdqa %xmm7, -24(%rsp)
        vmovdqa %xmm6, %xmm1
        movq    -16(%rsp), %rax
        vpinsrq $1, %rax, %xmm7, %xmm4
        vmovapd %xmm4, %xmm6
        vfmadd132pd     %xmm1, %xmm2, %xmm6
        vmovapd %xmm6, -24(%rsp)
        vmovsd  -16(%rsp), %xmm1
        vmovsd  -24(%rsp), %xmm0
        ret

with this patch, we now generate:

foo:	vpunpcklqdq     %xmm1, %xmm0, %xmm6
        vpunpcklqdq     %xmm3, %xmm2, %xmm7
        vpunpcklqdq     %xmm5, %xmm4, %xmm2
        vmovdqa %xmm6, %xmm1
        vfmadd132pd     %xmm7, %xmm2, %xmm1
        vmovsd  %xmm1, %xmm1, %xmm0
        vunpckhpd       %xmm1, %xmm1, %xmm1
        ret

The improvement is even more dramatic when compared to the original
29 instructions shown in comment gcc-mirror#8.  GCC 13, for example, required
12 transfers to/from memory.

2023-08-04  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	* config/i386/sse.md (define_split): Convert highpart:DF extract
	from V2DFmode register into a sse2_storehpd instruction.
	(define_split): Likewise, convert lowpart:DF extract from V2DF
	register into a sse2_storelpd instruction.

gcc/testsuite/ChangeLog
	* gcc.target/i386/pr88873.c: Tweak to check for improved code.
nstester pushed a commit to nstester/gcc that referenced this pull request Sep 4, 2023
As discussed in PR104167 (comments gcc-mirror#8 and below), and PR111238, using
-Wl,-gc-sections in the libstdc++ testsuite for arm-eabi
(cross-toolchain) avoids link failures for a few tests:

27_io/filesystem/path/108636.cc
std/time/clock/gps/1.cc
std/time/clock/gps/io.cc
std/time/clock/tai/1.cc
std/time/clock/tai/io.cc
std/time/clock/utc/1.cc
std/time/clock/utc/io.cc
std/time/clock/utc/leap_second_info.cc
std/time/exceptions.cc
std/time/format.cc
std/time/time_zone/get_info_local.cc
std/time/time_zone/get_info_sys.cc
std/time/tzdb/1.cc
std/time/tzdb/leap_seconds.cc
std/time/tzdb_list/1.cc
std/time/zoned_time/1.cc
std/time/zoned_time/custom.cc
std/time/zoned_time/io.cc
std/time/zoned_traits.cc

This patch achieves this by calling GLIBCXX_CHECK_LINKER_FEATURES in
cross-build cases, like we already do for native builds. We keep not
doing so in Canadian-cross builds.

However, this would hide the fact that libstdc++ somehow forces the
user to use -Wl,-gc-sections to avoid undefined references to chdir,
mkdir, chmod, pathconf, ... so maybe it's better to keep the status
quo and not apply this patch?

2023-08-31  Christophe Lyon  <christophe.lyon@linaro.org>

libstdc++-v3/ChangeLog:

	PR libstdc++/111238
	* configure: Regenerate.
	* configure.ac: Call GLIBCXX_CHECK_LINKER_FEATURES in cross,
	non-Canadian builds.
XYenChi pushed a commit to XYenChi/gcc that referenced this pull request Mar 6, 2024
Currently SLP tries to force permute operations "down" the graph
from loads in the hope of reducing the total number of permutations
needed or (in the best case) removing the need for the permutations
entirely.  This patch tries to extend it as follows:

- Allow loads to take a different permutation from the one they
  started with, rather than choosing between "original permutation"
  and "no permutation".

- Allow changes in both directions, if the target supports the
  reverse permutation.

- Treat the placement of permutations as a two-way dataflow problem:
  after propagating information from leaves to roots (as now), propagate
  information back up the graph.

- Take execution frequency into account when optimising for speed,
  so that (for example) permutations inside loops have a higher
  cost than permutations outside loops.

- Try to reduce the total number of permutations when optimising for
  size, even if that increases the number of permutations on a given
  execution path.

See the big block comment above vect_optimize_slp_pass for
a detailed description.

The original motivation for doing this was to add a framework that would
allow other layout differences in future.  The two main ones are:

- Make it easier to represent predicated operations, including
  predicated operations with gaps.  E.g.:

     a[0] += 1;
     a[1] += 1;
     a[3] += 1;

  could be a single load/add/store for SVE.  We could handle this
  by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
  (depending on what's being counted).  We might need to move
  elements between lanes at various points, like with permutes.

  (This would first mean adding support for stores with gaps.)

- Make it easier to switch between an even/odd and unpermuted layout
  when switching between wide and narrow elements.  E.g. if a widening
  operation produces an even vector and an odd vector, we should try
  to keep operations on the wide elements in that order rather than
  force them to be permuted back "in order".

To give some examples of what the patch does:

int f1(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  a[0] = (b[1] << c[3]) - d[1];
  a[1] = (b[0] << c[2]) - d[0];
  a[2] = (b[3] << c[1]) - d[3];
  a[3] = (b[2] << c[0]) - d[2];
}

continues to produce the same code as before when optimising for
speed: b, c and d are permuted at load time.  But when optimising
for size we instead permute c into the same order as b+d and then
permute the result of the arithmetic into the same order as a:

        ldr     q1, [x2]
        ldr     q0, [x1]
        ext     v1.16b, v1.16b, v1.16b, gcc-mirror#8     // <------
        sshl    v0.4s, v0.4s, v1.4s
        ldr     q1, [x3]
        sub     v0.4s, v0.4s, v1.4s
        rev64   v0.4s, v0.4s                   // <------
        str     q0, [x0]
        ret

The following function:

int f2(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  a[0] = (b[3] << c[3]) - d[3];
  a[1] = (b[2] << c[2]) - d[2];
  a[2] = (b[1] << c[1]) - d[1];
  a[3] = (b[0] << c[0]) - d[0];
}

continues to push the reverse down to just before the store,
like the previous code did.

In:

int f3(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  for (int i = 0; i < 100; ++i)
    {
      a[0] = (a[0] + c[3]);
      a[1] = (a[1] + c[2]);
      a[2] = (a[2] + c[1]);
      a[3] = (a[3] + c[0]);
      c += 4;
    }
}

the loads of a are hoisted and the stores of a are sunk, so that
only the load from c happens in the loop.  When optimising for
speed, we prefer to have the loop operate on the reversed layout,
changing on entry and exit from the loop:

        mov     x3, x0
        adrp    x0, .LC0
        add     x1, x2, 1600
        ldr     q2, [x0, #:lo12:.LC0]
        ldr     q0, [x3]
        mov     v1.16b, v0.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
        .p2align 3,,7
.L6:
        ldr     q1, [x2], 16
        add     v0.4s, v0.4s, v1.4s
        cmp     x2, x1
        bne     .L6
        mov     v1.16b, v0.16b
        adrp    x0, .LC0
        ldr     q2, [x0, #:lo12:.LC0]
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b    // <--------
        str     q0, [x3]
        ret

Similarly, for the very artificial testcase:

int f4(int *__restrict a, int *__restrict b, int *__restrict c,
       int *__restrict d)
{
  int a0 = a[0];
  int a1 = a[1];
  int a2 = a[2];
  int a3 = a[3];
  for (int i = 0; i < 100; ++i)
    {
      a0 ^= c[0];
      a1 ^= c[1];
      a2 ^= c[2];
      a3 ^= c[3];
      c += 4;
      for (int j = 0; j < 100; ++j)
	{
	  a0 += d[1];
	  a1 += d[0];
	  a2 += d[3];
	  a3 += d[2];
	  d += 4;
	}
      b[0] = a0;
      b[1] = a1;
      b[2] = a2;
      b[3] = a3;
      b += 4;
    }
  a[0] = a0;
  a[1] = a1;
  a[2] = a2;
  a[3] = a3;
}

the a vector in the inner loop maintains the order { 1, 0, 3, 2 },
even though it's part of an SCC that includes the outer loop.
In other words, this is a motivating case for not assigning
permutes at SCC granularity.  The code we get is:

        ldr     q0, [x0]
        mov     x4, x1
        mov     x5, x0
        add     x1, x3, 1600
        add     x3, x4, 1600
        .p2align 3,,7
.L11:
        ldr     q1, [x2], 16
        sub     x0, x1, #1600
        eor     v0.16b, v1.16b, v0.16b
        rev64   v0.4s, v0.4s              // <---
        .p2align 3,,7
.L10:
        ldr     q1, [x0], 16
        add     v0.4s, v0.4s, v1.4s
        cmp     x0, x1
        bne     .L10
        rev64   v0.4s, v0.4s              // <---
        add     x1, x0, 1600
        str     q0, [x4], 16
        cmp     x3, x4
        bne     .L11
        str     q0, [x5]
        ret

bb-slp-layout-17.c is a collection of compile tests for problems
I hit with earlier versions of the patch.  The same prolems might
show up elsewhere, but it seemed worth having the test anyway.

In slp-11b.c we previously pushed the permutation of the in[i*4]
group down from the load to just before the store.  That didn't
reduce the number or frequency of the permutations (or increase
them either).  But separating the permute from the load meant
that we could no longer use load/store lanes.

Whether load/store lanes are a good idea here is another question.
If there were two sets of loads, and if we could use a single
permutation instead of one per load, then avoiding load/store
lanes should be a good thing even under the current abstract
cost model.  But I think under the current model we should
try to avoid splitting up potential load/store lanes groups
if there is no specific benefit to the split.

Preferring load/store lanes is still a source of missed optimisations
that we should fix one day...

gcc/
	* params.opt (-param=vect-max-layout-candidates=): New parameter.
	* doc/invoke.texi (vect-max-layout-candidates): Document it.
	* tree-vectorizer.h (auto_lane_permutation_t): New typedef.
	(auto_load_permutation_t): Likewise.
	* tree-vect-slp.cc (vect_slp_node_weight): New function.
	(slpg_layout_cost): New class.
	(slpg_vertex): Replace perm_in and perm_out with partition,
	out_degree, weight and out_weight.
	(slpg_partition_info, slpg_partition_layout_costs): New classes.
	(vect_optimize_slp_pass): Likewise, cannibalizing some part of
	the previous vect_optimize_slp.
	(vect_optimize_slp): Use it.

gcc/testsuite/
	* lib/target-supports.exp (check_effective_target_vect_var_shift):
	Return true for aarch64.
	* gcc.dg/vect/bb-slp-layout-1.c: New test.
	* gcc.dg/vect/bb-slp-layout-2.c: New test.
	* gcc.dg/vect/bb-slp-layout-3.c: New test.
	* gcc.dg/vect/bb-slp-layout-4.c: New test.
	* gcc.dg/vect/bb-slp-layout-5.c: New test.
	* gcc.dg/vect/bb-slp-layout-6.c: New test.
	* gcc.dg/vect/bb-slp-layout-7.c: New test.
	* gcc.dg/vect/bb-slp-layout-8.c: New test.
	* gcc.dg/vect/bb-slp-layout-9.c: New test.
	* gcc.dg/vect/bb-slp-layout-10.c: New test.
	* gcc.dg/vect/bb-slp-layout-11.c: New test.
	* gcc.dg/vect/bb-slp-layout-13.c: New test.
	* gcc.dg/vect/bb-slp-layout-14.c: New test.
	* gcc.dg/vect/bb-slp-layout-15.c: New test.
	* gcc.dg/vect/bb-slp-layout-16.c: New test.
	* gcc.dg/vect/bb-slp-layout-17.c: New test.
	* gcc.dg/vect/slp-11b.c: XFAIL SLP test for load-lanes targets.
hubot pushed a commit that referenced this pull request May 12, 2024
Examining the code generated for the following C snippet on a
raspberry pi:

int popcount_lut8(unsigned *buf, int n)
{
  int cnt=0;
  unsigned int i;
  do {
    i = *buf;
    cnt += lut[i&255];
    cnt += lut[i>>8&255];
    cnt += lut[i>>16&255];
    cnt += lut[i>>24];
    buf++;
  } while(--n);
  return cnt;
}

I was surprised to see following instruction sequence generated by the
compiler:

  mov    r5, r2, lsr #8
  uxtb   r5, r5

This sequence can be performed by a single ARM instruction:

  uxtb   r5, r2, ror #8

The attached patch allows GCC's combine pass to take advantage of ARM's
uxtb with rotate functionality to implement the above zero_extract, and
likewise to use the sxtb with rotate to implement sign_extract.  ARM's
uxtb and sxtb can only be used with rotates of 0, 8, 16 and 24, and of
these only the 8 and 16 are useful [ror #0 is a nop, and extends with
ror #24 can be implemented using regular shifts],  so the approach here
is to add the six missing but useful instructions as 6 different
define_insn in arm.md, rather than try to be clever with new predicates.

Later ARM hardware has advanced bit field instructions, and earlier
ARM cores didn't support extend-with-rotate, so this appears to only
benefit armv6 era CPUs (e.g. the raspberry pi).

Patch posted:
https://gcc.gnu.org/legacy-ml/gcc-patches/2018-01/msg01339.html
Approved by Kyrill Tkachov:
https://gcc.gnu.org/legacy-ml/gcc-patches/2018-01/msg01881.html

2024-05-12  Roger Sayle  <roger@nextmovesoftware.com>
	    Kyrill Tkachov  <kyrylo.tkachov@foss.arm.com>

	* config/arm/arm.md (*arm_zeroextractsi2_8_8, *arm_signextractsi2_8_8,
	*arm_zeroextractsi2_8_16, *arm_signextractsi2_8_16,
	*arm_zeroextractsi2_16_8, *arm_signextractsi2_16_8): New.

2024-05-12  Roger Sayle  <roger@nextmovesoftware.com>
	    Kyrill Tkachov  <kyrylo.tkachov@foss.arm.com>

	* gcc.target/arm/extend-ror.c: New test.
fjtrujy added a commit to fjtrujy/gcc that referenced this pull request May 17, 2024
trcrsired pushed a commit to trcrsired/gcc that referenced this pull request May 20, 2024
* * fix undefined references in libgfortran

* * fix internal compiler error in gfortran

* Revert "* fix internal compiler error in gfortran"

This reverts commit 4c81782ca9e75120eb91e1b6b89559f277233621.

* * revert changes for undefined references in gfortran

* Fix build on x86_64-pc-linux-gnu when host is aarch64-w64-mingw32 (gcc-mirror#6)

* Fix x86_64-pc-linux-gnu and aarch64-pc-linux-gnu build

---------

Co-authored-by: Evgeny Karpov <eukarpov@gmail.com>
trcrsired pushed a commit to trcrsired/gcc that referenced this pull request May 26, 2024
* * fix undefined references in libgfortran

* * fix internal compiler error in gfortran

* Revert "* fix internal compiler error in gfortran"

This reverts commit 4c81782ca9e75120eb91e1b6b89559f277233621.

* * revert changes for undefined references in gfortran

* Fix build on x86_64-pc-linux-gnu when host is aarch64-w64-mingw32 (gcc-mirror#6)

* Fix x86_64-pc-linux-gnu and aarch64-pc-linux-gnu build

---------

Co-authored-by: Evgeny Karpov <eukarpov@gmail.com>
NinaRanns referenced this pull request in NinaRanns/gcc Jul 3, 2024
hubot pushed a commit that referenced this pull request Sep 7, 2024
…o_debug_section [PR116614]

cat abc.C
  #define A(n) struct T##n {} t##n;
  #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n##7) A(n##8) A(n##9)
  #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n##7) B(n##8) B(n##9)
  #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n##7) C(n##8) C(n##9)
  #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n##7) D(n##8) D(n##9)
  E(1) E(2) E(3)
  int main () { return 0; }
./xg++ -B ./ -o abc{.o,.C} -flto -flto-partition=1to1 -O2 -g -fdebug-types-section -c
./xgcc -B ./ -o abc{,.o} -flto -flto-partition=1to1 -O2
(not included in testsuite as it takes a while to compile) FAILs with
lto-wrapper: fatal error: Too many copied sections: Operation not supported
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

The following patch fixes that.  Most of the 64K+ section support for
reading and writing was already there years ago (and especially reading used
quite often already) and a further bug fixed in it in the PR104617 fix.

Yet, the fix isn't solely about removing the
  if (new_i - 1 >= SHN_LORESERVE)
    {
      *err = ENOTSUP;
      return "Too many copied sections";
    }
5 lines, the missing part was that the function only handled reading of
the .symtab_shndx section but not copying/updating of it.
If the result has less than 64K-epsilon sections, that actually wasn't
needed, but e.g. with -fdebug-types-section one can exceed that pretty
easily (reported to us on WebKitGtk build on ppc64le).
Updating the section is slightly more complicated, because it basically
needs to be done in lock step with updating the .symtab section, if one
doesn't need to use SHN_XINDEX in there, the section should (or should be
updated to) contain SHN_UNDEF entry, otherwise needs to have whatever would
be overwise stored but couldn't fit.  But repeating due to that all the
symtab decisions what to discard and how to rewrite it would be ugly.

So, the patch instead emits the .symtab_shndx section (or sections) last
and prepares the content during the .symtab processing and in a second
pass when going just through .symtab_shndx sections just uses the saved
content.

2024-09-07  Jakub Jelinek  <jakub@redhat.com>

	PR lto/116614
	* simple-object-elf.c (SHN_COMMON): Align comment with neighbouring
	comments.
	(SHN_HIRESERVE): Use uppercase hex digits instead of lowercase for
	consistency.
	(simple_object_elf_find_sections): Formatting fixes.
	(simple_object_elf_fetch_attributes): Likewise.
	(simple_object_elf_attributes_merge): Likewise.
	(simple_object_elf_start_write): Likewise.
	(simple_object_elf_write_ehdr): Likewise.
	(simple_object_elf_write_shdr): Likewise.
	(simple_object_elf_write_to_file): Likewise.
	(simple_object_elf_copy_lto_debug_section): Likewise.  Don't fail for
	new_i - 1 >= SHN_LORESERVE, instead arrange in that case to copy
	over .symtab_shndx sections, though emit those last and compute their
	section content when processing associated .symtab sections.  Handle
	simple_object_internal_read failure even in the .symtab_shndx reading
	case.
hubot pushed a commit that referenced this pull request Sep 12, 2024
…o_debug_section [PR116614]

cat abc.C
  #define A(n) struct T##n {} t##n;
  #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n##7) A(n##8) A(n##9)
  #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n##7) B(n##8) B(n##9)
  #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n##7) C(n##8) C(n##9)
  #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n##7) D(n##8) D(n##9)
  E(1) E(2) E(3)
  int main () { return 0; }
./xg++ -B ./ -o abc{.o,.C} -flto -flto-partition=1to1 -O2 -g -fdebug-types-section -c
./xgcc -B ./ -o abc{,.o} -flto -flto-partition=1to1 -O2
(not included in testsuite as it takes a while to compile) FAILs with
lto-wrapper: fatal error: Too many copied sections: Operation not supported
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

The following patch fixes that.  Most of the 64K+ section support for
reading and writing was already there years ago (and especially reading used
quite often already) and a further bug fixed in it in the PR104617 fix.

Yet, the fix isn't solely about removing the
  if (new_i - 1 >= SHN_LORESERVE)
    {
      *err = ENOTSUP;
      return "Too many copied sections";
    }
5 lines, the missing part was that the function only handled reading of
the .symtab_shndx section but not copying/updating of it.
If the result has less than 64K-epsilon sections, that actually wasn't
needed, but e.g. with -fdebug-types-section one can exceed that pretty
easily (reported to us on WebKitGtk build on ppc64le).
Updating the section is slightly more complicated, because it basically
needs to be done in lock step with updating the .symtab section, if one
doesn't need to use SHN_XINDEX in there, the section should (or should be
updated to) contain SHN_UNDEF entry, otherwise needs to have whatever would
be overwise stored but couldn't fit.  But repeating due to that all the
symtab decisions what to discard and how to rewrite it would be ugly.

So, the patch instead emits the .symtab_shndx section (or sections) last
and prepares the content during the .symtab processing and in a second
pass when going just through .symtab_shndx sections just uses the saved
content.

2024-09-07  Jakub Jelinek  <jakub@redhat.com>

	PR lto/116614
	* simple-object-elf.c (SHN_COMMON): Align comment with neighbouring
	comments.
	(SHN_HIRESERVE): Use uppercase hex digits instead of lowercase for
	consistency.
	(simple_object_elf_find_sections): Formatting fixes.
	(simple_object_elf_fetch_attributes): Likewise.
	(simple_object_elf_attributes_merge): Likewise.
	(simple_object_elf_start_write): Likewise.
	(simple_object_elf_write_ehdr): Likewise.
	(simple_object_elf_write_shdr): Likewise.
	(simple_object_elf_write_to_file): Likewise.
	(simple_object_elf_copy_lto_debug_section): Likewise.  Don't fail for
	new_i - 1 >= SHN_LORESERVE, instead arrange in that case to copy
	over .symtab_shndx sections, though emit those last and compute their
	section content when processing associated .symtab sections.  Handle
	simple_object_internal_read failure even in the .symtab_shndx reading
	case.

(cherry picked from commit bb8dd09)
hubot pushed a commit that referenced this pull request Sep 13, 2024
…o_debug_section [PR116614]

cat abc.C
  #define A(n) struct T##n {} t##n;
  #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n##7) A(n##8) A(n##9)
  #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n##7) B(n##8) B(n##9)
  #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n##7) C(n##8) C(n##9)
  #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n##7) D(n##8) D(n##9)
  E(1) E(2) E(3)
  int main () { return 0; }
./xg++ -B ./ -o abc{.o,.C} -flto -flto-partition=1to1 -O2 -g -fdebug-types-section -c
./xgcc -B ./ -o abc{,.o} -flto -flto-partition=1to1 -O2
(not included in testsuite as it takes a while to compile) FAILs with
lto-wrapper: fatal error: Too many copied sections: Operation not supported
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

The following patch fixes that.  Most of the 64K+ section support for
reading and writing was already there years ago (and especially reading used
quite often already) and a further bug fixed in it in the PR104617 fix.

Yet, the fix isn't solely about removing the
  if (new_i - 1 >= SHN_LORESERVE)
    {
      *err = ENOTSUP;
      return "Too many copied sections";
    }
5 lines, the missing part was that the function only handled reading of
the .symtab_shndx section but not copying/updating of it.
If the result has less than 64K-epsilon sections, that actually wasn't
needed, but e.g. with -fdebug-types-section one can exceed that pretty
easily (reported to us on WebKitGtk build on ppc64le).
Updating the section is slightly more complicated, because it basically
needs to be done in lock step with updating the .symtab section, if one
doesn't need to use SHN_XINDEX in there, the section should (or should be
updated to) contain SHN_UNDEF entry, otherwise needs to have whatever would
be overwise stored but couldn't fit.  But repeating due to that all the
symtab decisions what to discard and how to rewrite it would be ugly.

So, the patch instead emits the .symtab_shndx section (or sections) last
and prepares the content during the .symtab processing and in a second
pass when going just through .symtab_shndx sections just uses the saved
content.

2024-09-07  Jakub Jelinek  <jakub@redhat.com>

	PR lto/116614
	* simple-object-elf.c (SHN_COMMON): Align comment with neighbouring
	comments.
	(SHN_HIRESERVE): Use uppercase hex digits instead of lowercase for
	consistency.
	(simple_object_elf_find_sections): Formatting fixes.
	(simple_object_elf_fetch_attributes): Likewise.
	(simple_object_elf_attributes_merge): Likewise.
	(simple_object_elf_start_write): Likewise.
	(simple_object_elf_write_ehdr): Likewise.
	(simple_object_elf_write_shdr): Likewise.
	(simple_object_elf_write_to_file): Likewise.
	(simple_object_elf_copy_lto_debug_section): Likewise.  Don't fail for
	new_i - 1 >= SHN_LORESERVE, instead arrange in that case to copy
	over .symtab_shndx sections, though emit those last and compute their
	section content when processing associated .symtab sections.  Handle
	simple_object_internal_read failure even in the .symtab_shndx reading
	case.

(cherry picked from commit bb8dd09)
hubot pushed a commit that referenced this pull request Oct 18, 2024
Implement vddup and vidup using the new MVE builtins framework.

We generate better code because we take advantage of the two outputs
produced by the v[id]dup instructions.

For instance, before:
	ldr	r3, [r0]
	sub	r2, r3, #8
	str	r2, [r0]
	mov	r2, r3
	vddup.u16	q3, r2, #1

now:
	ldr	r2, [r0]
	vddup.u16	q3, r2, #1
	str	r2, [r0]

2024-08-21  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm-mve-builtins-base.cc (class viddup_impl): New.
	(vddup): New.
	(vidup): New.
	* config/arm/arm-mve-builtins-base.def (vddupq): New.
	(vidupq): New.
	* config/arm/arm-mve-builtins-base.h (vddupq): New.
	(vidupq): New.
	* config/arm/arm_mve.h (vddupq_m): Delete.
	(vddupq_u8): Delete.
	(vddupq_u32): Delete.
	(vddupq_u16): Delete.
	(vidupq_m): Delete.
	(vidupq_u8): Delete.
	(vidupq_u32): Delete.
	(vidupq_u16): Delete.
	(vddupq_x_u8): Delete.
	(vddupq_x_u16): Delete.
	(vddupq_x_u32): Delete.
	(vidupq_x_u8): Delete.
	(vidupq_x_u16): Delete.
	(vidupq_x_u32): Delete.
	(vddupq_m_n_u8): Delete.
	(vddupq_m_n_u32): Delete.
	(vddupq_m_n_u16): Delete.
	(vddupq_m_wb_u8): Delete.
	(vddupq_m_wb_u16): Delete.
	(vddupq_m_wb_u32): Delete.
	(vddupq_n_u8): Delete.
	(vddupq_n_u32): Delete.
	(vddupq_n_u16): Delete.
	(vddupq_wb_u8): Delete.
	(vddupq_wb_u16): Delete.
	(vddupq_wb_u32): Delete.
	(vidupq_m_n_u8): Delete.
	(vidupq_m_n_u32): Delete.
	(vidupq_m_n_u16): Delete.
	(vidupq_m_wb_u8): Delete.
	(vidupq_m_wb_u16): Delete.
	(vidupq_m_wb_u32): Delete.
	(vidupq_n_u8): Delete.
	(vidupq_n_u32): Delete.
	(vidupq_n_u16): Delete.
	(vidupq_wb_u8): Delete.
	(vidupq_wb_u16): Delete.
	(vidupq_wb_u32): Delete.
	(vddupq_x_n_u8): Delete.
	(vddupq_x_n_u16): Delete.
	(vddupq_x_n_u32): Delete.
	(vddupq_x_wb_u8): Delete.
	(vddupq_x_wb_u16): Delete.
	(vddupq_x_wb_u32): Delete.
	(vidupq_x_n_u8): Delete.
	(vidupq_x_n_u16): Delete.
	(vidupq_x_n_u32): Delete.
	(vidupq_x_wb_u8): Delete.
	(vidupq_x_wb_u16): Delete.
	(vidupq_x_wb_u32): Delete.
	(__arm_vddupq_m_n_u8): Delete.
	(__arm_vddupq_m_n_u32): Delete.
	(__arm_vddupq_m_n_u16): Delete.
	(__arm_vddupq_m_wb_u8): Delete.
	(__arm_vddupq_m_wb_u16): Delete.
	(__arm_vddupq_m_wb_u32): Delete.
	(__arm_vddupq_n_u8): Delete.
	(__arm_vddupq_n_u32): Delete.
	(__arm_vddupq_n_u16): Delete.
	(__arm_vidupq_m_n_u8): Delete.
	(__arm_vidupq_m_n_u32): Delete.
	(__arm_vidupq_m_n_u16): Delete.
	(__arm_vidupq_n_u8): Delete.
	(__arm_vidupq_m_wb_u8): Delete.
	(__arm_vidupq_m_wb_u16): Delete.
	(__arm_vidupq_m_wb_u32): Delete.
	(__arm_vidupq_n_u32): Delete.
	(__arm_vidupq_n_u16): Delete.
	(__arm_vidupq_wb_u8): Delete.
	(__arm_vidupq_wb_u16): Delete.
	(__arm_vidupq_wb_u32): Delete.
	(__arm_vddupq_wb_u8): Delete.
	(__arm_vddupq_wb_u16): Delete.
	(__arm_vddupq_wb_u32): Delete.
	(__arm_vddupq_x_n_u8): Delete.
	(__arm_vddupq_x_n_u16): Delete.
	(__arm_vddupq_x_n_u32): Delete.
	(__arm_vddupq_x_wb_u8): Delete.
	(__arm_vddupq_x_wb_u16): Delete.
	(__arm_vddupq_x_wb_u32): Delete.
	(__arm_vidupq_x_n_u8): Delete.
	(__arm_vidupq_x_n_u16): Delete.
	(__arm_vidupq_x_n_u32): Delete.
	(__arm_vidupq_x_wb_u8): Delete.
	(__arm_vidupq_x_wb_u16): Delete.
	(__arm_vidupq_x_wb_u32): Delete.
	(__arm_vddupq_m): Delete.
	(__arm_vddupq_u8): Delete.
	(__arm_vddupq_u32): Delete.
	(__arm_vddupq_u16): Delete.
	(__arm_vidupq_m): Delete.
	(__arm_vidupq_u8): Delete.
	(__arm_vidupq_u32): Delete.
	(__arm_vidupq_u16): Delete.
	(__arm_vddupq_x_u8): Delete.
	(__arm_vddupq_x_u16): Delete.
	(__arm_vddupq_x_u32): Delete.
	(__arm_vidupq_x_u8): Delete.
	(__arm_vidupq_x_u16): Delete.
	(__arm_vidupq_x_u32): Delete.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant