Basic information #8

5hello · 2016-10-12T19:14:12Z

No description provided.

Like the logical operations, expand all shifts early rather than only sometimes. The Neon shift expansions are never emitted (not even with -fneon-for-64bits), so they are not useful. So all the late expansions and Neon shift patterns can be removed, and shifts are more optimized as a result. Since some extend patterns use Neon DImode shifts, remove the Neon extend variants and related splits. A simple example now generates the same efficient code after this patch with -mfpu=neon and -mfpu=vfp (previously just the fact of having Neon enabled resulted inefficient code for no reason). unsigned long long f(unsigned long long x, unsigned long long y) { return x & (y >> 33); } Before: strd r4, r5, [sp, #-8]! lsr r4, r3, #1 mov r5, #0 and r1, r1, r5 and r0, r0, r4 ldrd r4, r5, [sp] add sp, sp, gcc-mirror#8 bx lr After: and r0, r0, r3, lsr #1 mov r1, #0 bx lr Bootstrap and regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57 gcc/ * config/arm/iterators.md (qhs_extenddi_cstr): Update. (qhs_extenddi_cstr): Likewise. * config/arm/arm.md (ashldi3): Always expand early. (ashlsi3): Likewise. (ashrsi3): Likewise. (zero_extend<mode>di2): Remove Neon variants. (extend<mode>di2): Likewise. * config/arm/neon.md (ashldi3_neon_noclobber): Remove. (signed_shift_di3_neon): Likewise. (unsigned_shift_di3_neon): Likewise. (ashrdi3_neon_imm_noclobber): Likewise. (lshrdi3_neon_imm_noclobber): Likewise. (<shift>di3_neon): Likewise. (split extend): Remove DI extend split patterns. gcc/testsuite/ * gcc.target/arm/neon-extend-1.c: Remove test. * gcc.target/arm/neon-extend-2.c: Remove test. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@274824 138bc75d-0d04-0410-961f-82ee72b054a4

…ested function In FDPIC mode, the trampoline generated to support pointers to nested functions looks like: .word trampoline address .word trampoline GOT address ldr r12, [pc, gcc-mirror#8] ldr r9, [pc, gcc-mirror#8] ldr pc, [pc, gcc-mirror#8] .word static chain value .word GOT address .word function's address because in FDPIC function pointers are actually pointers to function descriptors, we have to actually generate a function descriptor for the trampoline. 2019--09-10 Christophe Lyon <christophe.lyon@st.com> Mickaël Guêné <mickael.guene@st.com> gcc/ * config/arm/arm.c (arm_asm_trampoline_template): Add FDPIC support. (arm_trampoline_init): Likewise. (arm_trampoline_adjust_address): Likewise. * config/arm/arm.h (TRAMPOLINE_SIZE): Likewise. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@275571 138bc75d-0d04-0410-961f-82ee72b054a4

This patch extends support for -mpure-code to all thumb-1 processors, by removing the need for MOVT. Symbol addresses are built using upper8_15, upper0_7, lower8_15 and lower0_7 relocations, and constants are built using sequences of movs/adds and lsls instructions. The extension of the *thumb1_movhf pattern uses always the same size (6) although it can emit a shorter sequence when possible. This is similar to what *arm32_movhf already does. CASE_VECTOR_PC_RELATIVE is now false with -mpure-code, to avoid generating invalid assembly code with differences from symbols from two different sections (the difference cannot be computed by the assembler). Tests pr45701-[12].c needed a small adjustment to avoid matching upper8_15 when looking for the r8 register. Test no-literal-pool.c is augmented with __fp16, so it now uses -mfp16-format=ieee. Test thumb1-Os-mult.c generates an inline code sequence with -mpure-code and computes the multiplication by using a sequence of add/shift rather than using the multiply instruction, so we skip it in presence of -mpure-code. With -mcpu=cortex-m0, the pure-code/no-literal-pool.c fails because code like: static char *p = "Hello World"; char * testchar () { return p + 4; } generates 2 indirections (I removed non-essential directives/code) .section .rodata .LC0: .ascii "Hello World\000" .data p: .word .LC0 .section .rodata .LC2: .word p .section .text,"0x20000006",%progbits testchar: push {r7, lr} add r7, sp, #0 movs r3, #:upper8_15:#.LC2 lsls r3, gcc-mirror#8 adds r3, #:upper0_7:#.LC2 lsls r3, gcc-mirror#8 adds r3, #:lower8_15:#.LC2 lsls r3, gcc-mirror#8 adds r3, #:lower0_7:#.LC2 ldr r3, [r3] ldr r3, [r3] adds r3, r3, gcc-mirror#4 movs r0, r3 mov sp, r7 @ sp needed pop {r7, pc} By contrast, when using -mcpu=cortex-m4, the code looks like: .section .rodata .LC0: .ascii "Hello World\000" .data p: .word .LC0 testchar: push {r7} add r7, sp, #0 movw r3, #:lower16:p movt r3, #:upper16:p ldr r3, [r3] adds r3, r3, gcc-mirror#4 mov r0, r3 mov sp, r7 pop {r7} bx lr I haven't found yet how to make code for cortex-m0 apply upper/lower relocations to "p" instead of .LC2. The current code looks functional, but could be improved. 2019-10-18 Christophe Lyon <christophe.lyon@linaro.org> gcc/ * config/arm/arm-protos.h (thumb1_gen_const_int): Add new prototype. * config/arm/arm.c (arm_option_check_internal): Remove restriction on MOVT for -mpure-code. (thumb1_gen_const_int): New function. (thumb1_legitimate_address_p): Support -mpure-code. (thumb1_rtx_costs): Likewise. (thumb1_size_rtx_costs): Likewise. (arm_thumb1_mi_thunk): Likewise. * config/arm/arm.h (CASE_VECTOR_PC_RELATIVE): Likewise. * config/arm/thumb1.md (thumb1_movsi_symbol_ref): New. (*thumb1_movhf): Support -mpure-code. gcc/testsuite/ * gcc.target/arm/pr45701-1.c: Adjust for -mpure-code. * gcc.target/arm/pr45701-2.c: Likewise. * gcc.target/arm/pure-code/no-literal-pool.c: Add tests for __fp16. * gcc.target/arm/pure-code/pure-code.exp: Remove thumb2 and movt conditions. * gcc.target/arm/thumb1-Os-mult.c: Skip if -mpure-code is used. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@279463 138bc75d-0d04-0410-961f-82ee72b054a4

This patch extends support for -mpure-code to all thumb-1 processors, by removing the need for MOVT. Symbol addresses are built using upper8_15, upper0_7, lower8_15 and lower0_7 relocations, and constants are built using sequences of movs/adds and lsls instructions. The extension of the *thumb1_movhf pattern uses always the same size (6) although it can emit a shorter sequence when possible. This is similar to what *arm32_movhf already does. CASE_VECTOR_PC_RELATIVE is now false with -mpure-code, to avoid generating invalid assembly code with differences from symbols from two different sections (the difference cannot be computed by the assembler). Tests pr45701-[12].c needed a small adjustment to avoid matching upper8_15 when looking for the r8 register. Test no-literal-pool.c is augmented with __fp16, so it now uses -mfp16-format=ieee. Test thumb1-Os-mult.c generates an inline code sequence with -mpure-code and computes the multiplication by using a sequence of add/shift rather than using the multiply instruction, so we skip it in presence of -mpure-code. With -mcpu=cortex-m0, the pure-code/no-literal-pool.c fails because code like: static char *p = "Hello World"; char * testchar () { return p + 4; } generates 2 indirections (I removed non-essential directives/code) .section .rodata .LC0: .ascii "Hello World\000" .data p: .word .LC0 .section .rodata .LC2: .word p .section .text,"0x20000006",%progbits testchar: push {r7, lr} add r7, sp, #0 movs r3, #:upper8_15:#.LC2 lsls r3, gcc-mirror#8 adds r3, #:upper0_7:#.LC2 lsls r3, gcc-mirror#8 adds r3, #:lower8_15:#.LC2 lsls r3, gcc-mirror#8 adds r3, #:lower0_7:#.LC2 ldr r3, [r3] ldr r3, [r3] adds r3, r3, gcc-mirror#4 movs r0, r3 mov sp, r7 @ sp needed pop {r7, pc} By contrast, when using -mcpu=cortex-m4, the code looks like: .section .rodata .LC0: .ascii "Hello World\000" .data p: .word .LC0 testchar: push {r7} add r7, sp, #0 movw r3, #:lower16:p movt r3, #:upper16:p ldr r3, [r3] adds r3, r3, gcc-mirror#4 mov r0, r3 mov sp, r7 pop {r7} bx lr I haven't found yet how to make code for cortex-m0 apply upper/lower relocations to "p" instead of .LC2. The current code looks functional, but could be improved. 2019-10-18 Christophe Lyon <christophe.lyon@linaro.org> gcc/ * config/arm/arm-protos.h (thumb1_gen_const_int): Add new prototype. * config/arm/arm.c (arm_option_check_internal): Remove restriction on MOVT for -mpure-code. (thumb1_gen_const_int): New function. (thumb1_legitimate_address_p): Support -mpure-code. (thumb1_rtx_costs): Likewise. (thumb1_size_rtx_costs): Likewise. (arm_thumb1_mi_thunk): Likewise. * config/arm/arm.h (CASE_VECTOR_PC_RELATIVE): Likewise. * config/arm/thumb1.md (thumb1_movsi_symbol_ref): New. (*thumb1_movhf): Support -mpure-code. gcc/testsuite/ * gcc.target/arm/pr45701-1.c: Adjust for -mpure-code. * gcc.target/arm/pr45701-2.c: Likewise. * gcc.target/arm/pure-code/no-literal-pool.c: Add tests for __fp16. * gcc.target/arm/pure-code/pure-code.exp: Remove thumb2 and movt conditions. * gcc.target/arm/thumb1-Os-mult.c: Skip if -mpure-code is used. From-SVN: r279463

This patch extends support for -mpure-code to all thumb-1 processors, by removing the need for MOVT. Symbol addresses are built using upper8_15, upper0_7, lower8_15 and lower0_7 relocations, and constants are built using sequences of movs/adds and lsls instructions. The extension of the *thumb1_movhf pattern uses always the same size (6) although it can emit a shorter sequence when possible. This is similar to what *arm32_movhf already does. CASE_VECTOR_PC_RELATIVE is now false with -mpure-code, to avoid generating invalid assembly code with differences from symbols from two different sections (the difference cannot be computed by the assembler). Tests pr45701-[12].c needed a small adjustment to avoid matching upper8_15 when looking for the r8 register. Test no-literal-pool.c is augmented with __fp16, so it now uses -mfp16-format=ieee. Test thumb1-Os-mult.c generates an inline code sequence with -mpure-code and computes the multiplication by using a sequence of add/shift rather than using the multiply instruction, so we skip it in presence of -mpure-code. With -mcpu=cortex-m0, the pure-code/no-literal-pool.c fails because code like: static char *p = "Hello World"; char * testchar () { return p + 4; } generates 2 indirections (I removed non-essential directives/code) .section .rodata .LC0: .ascii "Hello World\000" .data p: .word .LC0 .section .rodata .LC2: .word p .section .text,"0x20000006",%progbits testchar: push {r7, lr} add r7, sp, #0 movs r3, #:upper8_15:#.LC2 lsls r3, gcc-mirror#8 adds r3, #:upper0_7:#.LC2 lsls r3, gcc-mirror#8 adds r3, #:lower8_15:#.LC2 lsls r3, gcc-mirror#8 adds r3, #:lower0_7:#.LC2 ldr r3, [r3] ldr r3, [r3] adds r3, r3, gcc-mirror#4 movs r0, r3 mov sp, r7 @ sp needed pop {r7, pc} By contrast, when using -mcpu=cortex-m4, the code looks like: .section .rodata .LC0: .ascii "Hello World\000" .data p: .word .LC0 testchar: push {r7} add r7, sp, #0 movw r3, #:lower16:p movt r3, #:upper16:p ldr r3, [r3] adds r3, r3, gcc-mirror#4 mov r0, r3 mov sp, r7 pop {r7} bx lr I haven't found yet how to make code for cortex-m0 apply upper/lower relocations to "p" instead of .LC2. The current code looks functional, but could be improved. 2020-02-25 Christophe Lyon <christophe.lyon@linaro.org> Backport from mainline 2019-10-18 Christophe Lyon <christophe.lyon@linaro.org> gcc/ * config/arm/arm-protos.h (thumb1_gen_const_int): Add new prototype. * config/arm/arm.c (arm_option_check_internal): Remove restriction on MOVT for -mpure-code. (thumb1_gen_const_int): New function. (thumb1_legitimate_address_p): Support -mpure-code. (thumb1_rtx_costs): Likewise. (thumb1_size_rtx_costs): Likewise. (arm_thumb1_mi_thunk): Likewise. * config/arm/arm.h (CASE_VECTOR_PC_RELATIVE): Likewise. * config/arm/thumb1.md (thumb1_movsi_symbol_ref): New. (*thumb1_movhf): Support -mpure-code. gcc/testsuite/ * gcc.target/arm/pr45701-1.c: Adjust for -mpure-code. * gcc.target/arm/pr45701-2.c: Likewise. * gcc.target/arm/pure-code/no-literal-pool.c: Add tests for __fp16. * gcc.target/arm/pure-code/pure-code.exp: Remove thumb2 and movt conditions. * gcc.target/arm/thumb1-Os-mult.c: Skip if -mpure-code is used.

When using `check-function-bodies`, the subroutine `parse_function_bodies` uses the `fluff` regexp to remove uninteresting assembly lines. Arm targets generate assembly with some lines prefixed by `@`, these lines are left by this process. As an example of some lines prefixed by `@': the assembly output from the `stacktest1` function in "bfloat16_simd_3_1.c" is: .align 2 .global stacktest1 .arch armv8.2-a .syntax unified .arm .fpu neon-fp-armv8 .type stacktest1, %function stacktest1: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. sub sp, sp, gcc-mirror#8 add r3, sp, gcc-mirror#6 vst1.16 {d0[0]}, [r3] vld1.16 {d0[0]}, [r3] add sp, sp, gcc-mirror#8 @ sp needed bx lr .size stacktest1, .-stacktest1 It seems that previous uses of `check-function-bodies` in the arm backend have avoided problems with such lines since they use the `...` regexp in each place such fluff occurs. I'm currently writing a patch that I'd like to match the entire function body, so I'd like to remove such `@` lines automatically. gcc/testsuite/ChangeLog: 2020-03-11 Matthew Malcomson <matthew.malcomson@arm.com> * lib/scanasm.exp (parse_function_bodies): Lines starting with '@' also counted as fluff.

The stack-protector-1.c test fails when compiled for Cortex-M: - for Cortex-M0/M1, str r0, [sp #-8]! is not supported - for Cortex-M3/M4..., the assembler complains that "use of r13 is deprecated" This patch replaces the str instruction with sub sp, sp, gcc-mirror#8 str r0, [sp] and removes the check for r13, which is unlikely to leak the canary value. 2020-08-11 Christophe Lyon <christophe.lyon@linaro.org> gcc/testsuite/ * gcc.target/arm/stack-protector-1.c: Adapt code to Cortex-M restrictions.

The stack-protector-1.c test fails when compiled for Cortex-M: - for Cortex-M0/M1, str r0, [sp #-8]! is not supported - for Cortex-M3/M4..., the assembler complains that "use of r13 is deprecated" This patch replaces the str instruction with sub sp, sp, gcc-mirror#8 str r0, [sp] and removes the check for r13, which is unlikely to leak the canary value. 2020-08-11 Christophe Lyon <christophe.lyon@linaro.org> gcc/testsuite/ * gcc.target/arm/stack-protector-1.c: Adapt code to Cortex-M restrictions. (cherry picked from commit 6606fdc)

This patch moves the move-immediate splitter after the regular ones so that it has lower precedence, and updates its constraints. For int f3 (void) { return 0x11000000; } int f3_2 (void) { return 0x12345678; } we now generate: * with -O2 -mcpu=cortex-m0 -mpure-code: f3: movs r0, #136 lsls r0, r0, gcc-mirror#21 bx lr f3_2: movs r0, gcc-mirror#18 lsls r0, r0, gcc-mirror#8 adds r0, r0, gcc-mirror#52 lsls r0, r0, gcc-mirror#8 adds r0, r0, gcc-mirror#86 lsls r0, r0, gcc-mirror#8 adds r0, r0, #121 bx lr * with -O2 -mcpu=cortex-m23 -mpure-code: f3: movs r0, #136 lsls r0, r0, gcc-mirror#21 bx lr f3_2: movw r0, #22136 movt r0, 4660 bx lr 2020-09-04 Christophe Lyon <christophe.lyon@linaro.org> PR target/96769 gcc/ * config/arm/thumb1.md: Move movsi splitter for arm_disable_literal_pool after the other movsi splitters. gcc/testsuite/ * gcc.target/arm/pure-code/pr96769.c: New test.

Prevents the following UBSAN error: ./xgcc -B. /home/marxin/Programming/gcc/gcc/testsuite/g++.dg/torture/pr49770.C -O2 -c /home/marxin/Programming/gcc2/gcc/ipa-modref-tree.h:482:22: runtime error: load of value 2, which is not a valid value for type 'bool' #0 0x1fdb4d1 in modref_tree<int>::merge(modref_tree<int>*, vec<modref_parm_map, va_heap, vl_ptr>*) /home/marxin/Programming/gcc2/gcc/ipa-modref-tree.h:482 #1 0x1fcadaa in merge_call_side_effects(modref_summary*, gimple*, modref_summary*, bool) /home/marxin/Programming/gcc2/gcc/ipa-modref.c:511 gcc-mirror#2 0x1fcbadd in analyze_call /home/marxin/Programming/gcc2/gcc/ipa-modref.c:642 gcc-mirror#3 0x1fcc061 in analyze_stmt /home/marxin/Programming/gcc2/gcc/ipa-modref.c:732 gcc-mirror#4 0x1fccf31 in analyze_function /home/marxin/Programming/gcc2/gcc/ipa-modref.c:823 gcc-mirror#5 0x1fd17e5 in execute /home/marxin/Programming/gcc2/gcc/ipa-modref.c:1441 gcc-mirror#6 0x25cca6e in execute_one_pass(opt_pass*) /home/marxin/Programming/gcc2/gcc/passes.c:2509 gcc-mirror#7 0x25cd39b in execute_pass_list_1 /home/marxin/Programming/gcc2/gcc/passes.c:2597 gcc-mirror#8 0x25cd450 in execute_pass_list_1 /home/marxin/Programming/gcc2/gcc/passes.c:2598 gcc-mirror#9 0x25cd4ee in execute_pass_list(function*, opt_pass*) /home/marxin/Programming/gcc2/gcc/passes.c:2608 gcc-mirror#10 0x25c7a5a in do_per_function_toporder(void (*)(function*, void*), void*) /home/marxin/Programming/gcc2/gcc/passes.c:1726 gcc-mirror#11 0x25cfa3f in execute_ipa_pass_list(opt_pass*) /home/marxin/Programming/gcc2/gcc/passes.c:2941 gcc-mirror#12 0x173572d in ipa_passes /home/marxin/Programming/gcc2/gcc/cgraphunit.c:2642 gcc-mirror#13 0x17364ee in symbol_table::compile() /home/marxin/Programming/gcc2/gcc/cgraphunit.c:2777 gcc-mirror#14 0x17372d9 in symbol_table::finalize_compilation_unit() /home/marxin/Programming/gcc2/gcc/cgraphunit.c:3022 gcc-mirror#15 0x2a1f00a in compile_file /home/marxin/Programming/gcc2/gcc/toplev.c:485 gcc-mirror#16 0x2a27dc8 in do_compile /home/marxin/Programming/gcc2/gcc/toplev.c:2321 gcc-mirror#17 0x2a283cc in toplev::main(int, char**) /home/marxin/Programming/gcc2/gcc/toplev.c:2460 gcc-mirror#18 0x54f21cd in main /home/marxin/Programming/gcc2/gcc/main.c:39 gcc-mirror#19 0x7ffff6f0de09 in __libc_start_main ../csu/libc-start.c:314 gcc-mirror#20 0x9eac09 in _start (/home/marxin/Programming/gcc2/objdir/gcc/cc1plus+0x9eac09) gcc/ChangeLog: * ipa-modref.c (merge_call_side_effects): Clear modref_parm_map fields in the vector.

…trinsics with -O2 (PR97271). This patch fixes (PR97271) the wrong code-gen for mve scatter store with writeback intrinsics with -O2. $cat bug.c void foo (uint32x4_t * addr, const int offset, int32x4_t value) { vstrwq_scatter_base_wb_s32 (addr, 8, value); } $ arm-none-eabi-gcc bug.c -S -O2 -march=armv8.1-m.main+mve -mfloat-abi=hard -o - Without this patch: ... foo: vldrw.32 q3, [r0] vstrw.u32 q0, [q3, gcc-mirror#8]! ---> (A) vldr.64 d4, .L3 vldr.64 d5, .L3+8 vldrw.32 q3, [r0] vstrw.u32 q2, [q3, gcc-mirror#8]! ---> (B) bx lr ... With this patch: ... foo: vldrw.32 q3, [r0] vstrw.u32 q0, [q3, gcc-mirror#8]! --> (C) vstrw.32 q3, [r0] bx lr ... Without this patch 2 vstrw assembly instructions (A and B) are generated for vstrwq_scatter_base_wb_s32 intrinsic where as fix generates only one vstrw assembly instruction (C). gcc/ChangeLog: 2020-10-06 Srinath Parvathaneni <srinath.parvathaneni@arm.com> PR target/97291 * config/arm/arm-builtins.c (arm_strsbwbs_qualifiers): Modify array. (arm_strsbwbu_qualifiers): Likewise. (arm_strsbwbs_p_qualifiers): Likewise. (arm_strsbwbu_p_qualifiers): Likewise. * config/arm/arm_mve.h (__arm_vstrdq_scatter_base_wb_s64): Modify function definition. (__arm_vstrdq_scatter_base_wb_u64): Likewise. (__arm_vstrdq_scatter_base_wb_p_s64): Likewise. (__arm_vstrdq_scatter_base_wb_p_u64): Likewise. (__arm_vstrwq_scatter_base_wb_p_s32): Likewise. (__arm_vstrwq_scatter_base_wb_p_u32): Likewise. (__arm_vstrwq_scatter_base_wb_s32): Likewise. (__arm_vstrwq_scatter_base_wb_u32): Likewise. (__arm_vstrwq_scatter_base_wb_f32): Likewise. (__arm_vstrwq_scatter_base_wb_p_f32): Likewise. * config/arm/arm_mve_builtins.def (vstrwq_scatter_base_wb_add_u): Remove expansion for the builtin. (vstrwq_scatter_base_wb_add_s): Likewise. (vstrwq_scatter_base_wb_add_f): Likewise. (vstrdq_scatter_base_wb_add_u): Likewise. (vstrdq_scatter_base_wb_add_s): Likewise. (vstrwq_scatter_base_wb_p_add_u): Likewise. (vstrwq_scatter_base_wb_p_add_s): Likewise. (vstrwq_scatter_base_wb_p_add_f): Likewise. (vstrdq_scatter_base_wb_p_add_u): Likewise. (vstrdq_scatter_base_wb_p_add_s): Likewise. * config/arm/mve.md (mve_vstrwq_scatter_base_wb_<supf>v4si): Remove expand. (mve_vstrwq_scatter_base_wb_add_<supf>v4si): Likewise. (mve_vstrwq_scatter_base_wb_<supf>v4si_insn): Rename pattern to ... (mve_vstrwq_scatter_base_wb_<supf>v4si): This. (mve_vstrwq_scatter_base_wb_p_<supf>v4si): Remove expand. (mve_vstrwq_scatter_base_wb_p_add_<supf>v4si): Likewise. (mve_vstrwq_scatter_base_wb_p_<supf>v4si_insn): Rename pattern to ... (mve_vstrwq_scatter_base_wb_p_<supf>v4si): This. (mve_vstrwq_scatter_base_wb_fv4sf): Remove expand. (mve_vstrwq_scatter_base_wb_add_fv4sf): Likewise. (mve_vstrwq_scatter_base_wb_fv4sf_insn): Rename pattern to ... (mve_vstrwq_scatter_base_wb_fv4sf): This. (mve_vstrwq_scatter_base_wb_p_fv4sf): Remove expand. (mve_vstrwq_scatter_base_wb_p_add_fv4sf): Likewise. (mve_vstrwq_scatter_base_wb_p_fv4sf_insn): Rename pattern to ... (mve_vstrwq_scatter_base_wb_p_fv4sf): This. (mve_vstrdq_scatter_base_wb_<supf>v2di): Remove expand. (mve_vstrdq_scatter_base_wb_add_<supf>v2di): Likewise. (mve_vstrdq_scatter_base_wb_<supf>v2di_insn): Rename pattern to ... (mve_vstrdq_scatter_base_wb_<supf>v2di): This. (mve_vstrdq_scatter_base_wb_p_<supf>v2di): Remove expand. (mve_vstrdq_scatter_base_wb_p_add_<supf>v2di): Likewise. (mve_vstrdq_scatter_base_wb_p_<supf>v2di_insn): Rename pattern to ... (mve_vstrdq_scatter_base_wb_p_<supf>v2di): This. gcc/testsuite/ChangeLog: PR target/97291 * gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_p_s64.c: Modify. * gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_p_u64.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_s64.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_u64.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_u32.c: Likewise.

…trinsics with -O2 (PR97271). This patch fixes (PR97271) the wrong code-gen for mve scatter store with writeback intrinsics with -O2. $cat bug.c void foo (uint32x4_t * addr, const int offset, int32x4_t value) { vstrwq_scatter_base_wb_s32 (addr, 8, value); } $ arm-none-eabi-gcc bug.c -S -O2 -march=armv8.1-m.main+mve -mfloat-abi=hard -o - Without this patch: ... foo: vldrw.32 q3, [r0] vstrw.u32 q0, [q3, gcc-mirror#8]! ---> (A) vldr.64 d4, .L3 vldr.64 d5, .L3+8 vldrw.32 q3, [r0] vstrw.u32 q2, [q3, gcc-mirror#8]! ---> (B) bx lr ... With this patch: ... foo: vldrw.32 q3, [r0] vstrw.u32 q0, [q3, gcc-mirror#8]! --> (C) vstrw.32 q3, [r0] bx lr ... Without this patch 2 vstrw assembly instructions (A and B) are generated for vstrwq_scatter_base_wb_s32 intrinsic where as fix generates only one vstrw assembly instruction (C). gcc/ChangeLog: 2020-10-06 Srinath Parvathaneni <srinath.parvathaneni@arm.com> PR target/97291 * config/arm/arm-builtins.c (arm_strsbwbs_qualifiers): Modify array. (arm_strsbwbu_qualifiers): Likewise. (arm_strsbwbs_p_qualifiers): Likewise. (arm_strsbwbu_p_qualifiers): Likewise. * config/arm/arm_mve.h (__arm_vstrdq_scatter_base_wb_s64): Modify function definition. (__arm_vstrdq_scatter_base_wb_u64): Likewise. (__arm_vstrdq_scatter_base_wb_p_s64): Likewise. (__arm_vstrdq_scatter_base_wb_p_u64): Likewise. (__arm_vstrwq_scatter_base_wb_p_s32): Likewise. (__arm_vstrwq_scatter_base_wb_p_u32): Likewise. (__arm_vstrwq_scatter_base_wb_s32): Likewise. (__arm_vstrwq_scatter_base_wb_u32): Likewise. (__arm_vstrwq_scatter_base_wb_f32): Likewise. (__arm_vstrwq_scatter_base_wb_p_f32): Likewise. * config/arm/arm_mve_builtins.def (vstrwq_scatter_base_wb_add_u): Remove expansion for the builtin. (vstrwq_scatter_base_wb_add_s): Likewise. (vstrwq_scatter_base_wb_add_f): Likewise. (vstrdq_scatter_base_wb_add_u): Likewise. (vstrdq_scatter_base_wb_add_s): Likewise. (vstrwq_scatter_base_wb_p_add_u): Likewise. (vstrwq_scatter_base_wb_p_add_s): Likewise. (vstrwq_scatter_base_wb_p_add_f): Likewise. (vstrdq_scatter_base_wb_p_add_u): Likewise. (vstrdq_scatter_base_wb_p_add_s): Likewise. * config/arm/mve.md (mve_vstrwq_scatter_base_wb_<supf>v4si): Remove expand. (mve_vstrwq_scatter_base_wb_add_<supf>v4si): Likewise. (mve_vstrwq_scatter_base_wb_<supf>v4si_insn): Rename pattern to ... (mve_vstrwq_scatter_base_wb_<supf>v4si): This. (mve_vstrwq_scatter_base_wb_p_<supf>v4si): Remove expand. (mve_vstrwq_scatter_base_wb_p_add_<supf>v4si): Likewise. (mve_vstrwq_scatter_base_wb_p_<supf>v4si_insn): Rename pattern to ... (mve_vstrwq_scatter_base_wb_p_<supf>v4si): This. (mve_vstrwq_scatter_base_wb_fv4sf): Remove expand. (mve_vstrwq_scatter_base_wb_add_fv4sf): Likewise. (mve_vstrwq_scatter_base_wb_fv4sf_insn): Rename pattern to ... (mve_vstrwq_scatter_base_wb_fv4sf): This. (mve_vstrwq_scatter_base_wb_p_fv4sf): Remove expand. (mve_vstrwq_scatter_base_wb_p_add_fv4sf): Likewise. (mve_vstrwq_scatter_base_wb_p_fv4sf_insn): Rename pattern to ... (mve_vstrwq_scatter_base_wb_p_fv4sf): This. (mve_vstrdq_scatter_base_wb_<supf>v2di): Remove expand. (mve_vstrdq_scatter_base_wb_add_<supf>v2di): Likewise. (mve_vstrdq_scatter_base_wb_<supf>v2di_insn): Rename pattern to ... (mve_vstrdq_scatter_base_wb_<supf>v2di): This. (mve_vstrdq_scatter_base_wb_p_<supf>v2di): Remove expand. (mve_vstrdq_scatter_base_wb_p_add_<supf>v2di): Likewise. (mve_vstrdq_scatter_base_wb_p_<supf>v2di_insn): Rename pattern to ... (mve_vstrdq_scatter_base_wb_p_<supf>v2di): This. gcc/testsuite/ChangeLog: PR target/97291 * gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_p_s64.c: Modify. * gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_p_u64.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_s64.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrdq_scatter_base_wb_u64.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_u32.c: Likewise. (cherry picked from commit 3775358)

It fixes: /home/marxin/Programming/gcc2/gcc/ipa-modref-tree.h:482:22: runtime error: load of value 255, which is not a valid value for type 'bool' #0 0x18e5df3 in modref_tree<int>::merge(modref_tree<int>*, vec<modref_parm_map, va_heap, vl_ptr>*) /home/marxin/Programming/gcc2/gcc/ipa-modref-tree.h:482 #1 0x18dc180 in ipa_merge_modref_summary_after_inlining(cgraph_edge*) /home/marxin/Programming/gcc2/gcc/ipa-modref.c:1779 gcc-mirror#2 0x18c1c72 in inline_call(cgraph_edge*, bool, vec<cgraph_edge*, va_heap, vl_ptr>*, int*, bool, bool*) /home/marxin/Programming/gcc2/gcc/ipa-inline-transform.c:492 gcc-mirror#3 0x4a3589c in inline_small_functions /home/marxin/Programming/gcc2/gcc/ipa-inline.c:2216 gcc-mirror#4 0x4a3b230 in ipa_inline /home/marxin/Programming/gcc2/gcc/ipa-inline.c:2697 gcc-mirror#5 0x4a3d902 in execute /home/marxin/Programming/gcc2/gcc/ipa-inline.c:3096 gcc-mirror#6 0x1edf831 in execute_one_pass(opt_pass*) /home/marxin/Programming/gcc2/gcc/passes.c:2509 gcc-mirror#7 0x1ee26af in execute_ipa_pass_list(opt_pass*) /home/marxin/Programming/gcc2/gcc/passes.c:2936 gcc-mirror#8 0x103f31b in ipa_passes /home/marxin/Programming/gcc2/gcc/cgraphunit.c:2700 gcc-mirror#9 0x103fb40 in symbol_table::compile() /home/marxin/Programming/gcc2/gcc/cgraphunit.c:2777 gcc-mirror#10 0x104092b in symbol_table::finalize_compilation_unit() /home/marxin/Programming/gcc2/gcc/cgraphunit.c:3022 gcc-mirror#11 0x235723b in compile_file /home/marxin/Programming/gcc2/gcc/toplev.c:485 gcc-mirror#12 0x235fff9 in do_compile /home/marxin/Programming/gcc2/gcc/toplev.c:2321 gcc-mirror#13 0x23605fc in toplev::main(int, char**) /home/marxin/Programming/gcc2/gcc/toplev.c:2460 gcc-mirror#14 0x4e2b93b in main /home/marxin/Programming/gcc2/gcc/main.c:39 gcc-mirror#15 0x7ffff6f0ae09 in __libc_start_main ../csu/libc-start.c:314 gcc-mirror#16 0x9a0be9 in _start (/home/marxin/Programming/gcc2/objdir/gcc/cc1+0x9a0be9) gcc/ChangeLog: * ipa-modref.c (compute_parm_map): Clear vector.

Enable thumb1_gen_const_int to generate RTL or asm depending on the context, so that we avoid duplicating code to handle constants in Thumb-1 with -mpure-code. Use a template so that the algorithm is effectively shared, and rely on two classes to handle the actual emission as RTL or asm. The generated sequence is improved to handle right-shiftable and small values with less instructions. We now generate: 128: movs r0, r0, #128 264: movs r3, gcc-mirror#33 lsls r3, gcc-mirror#3 510: movs r3, #255 lsls r3, #1 512: movs r3, #1 lsls r3, gcc-mirror#9 764: movs r3, #191 lsls r3, gcc-mirror#2 65536: movs r3, #1 lsls r3, gcc-mirror#16 0x123456: movs r3, gcc-mirror#18 ;0x12 lsls r3, gcc-mirror#8 adds r3, gcc-mirror#52 ;0x34 lsls r3, gcc-mirror#8 adds r3, gcc-mirror#86 ;0x56 0x1123456: movs r3, #137 ;0x89 lsls r3, gcc-mirror#8 adds r3, gcc-mirror#26 ;0x1a lsls r3, gcc-mirror#8 adds r3, gcc-mirror#43 ;0x2b lsls r3, #1 0x1000010: movs r3, gcc-mirror#16 lsls r3, gcc-mirror#16 adds r3, #1 lsls r3, gcc-mirror#4 0x1000011: movs r3, #1 lsls r3, gcc-mirror#24 adds r3, gcc-mirror#17 -8192: movs r3, #1 lsls r3, gcc-mirror#13 rsbs r3, #0 The patch adds a testcase which does not fully exercise thumb1_gen_const_int, as other existing patterns already catch small constants. These parts of thumb1_gen_const_int are used by arm_thumb1_mi_thunk. 2020-11-02 Christophe Lyon <christophe.lyon@linaro.org> gcc/ * config/arm/arm.c (thumb1_const_rtl, thumb1_const_print): New classes. (thumb1_gen_const_int): Rename to ... (thumb1_gen_const_int_1): ... New helper function. Add capability to emit either RTL or asm, improve generated code. (thumb1_gen_const_int_rtl): New function. * config/arm/arm-protos.h (thumb1_gen_const_int): Rename to thumb1_gen_const_int_rtl. * config/arm/thumb1.md: Call thumb1_gen_const_int_rtl instead of thumb1_gen_const_int. gcc/testsuite/ * gcc.target/arm/pure-code/no-literal-pool-m0.c: New.

This patch adds new movmisalign<mode>_mve_load and store patterns for MVE to help vectorization. They are very similar to their Neon counterparts, but use different iterators and instructions. Indeed MVE supports less vectors modes than Neon, so we use the MVE_VLD_ST iterator where Neon uses VQX. Since the supported modes are different from the ones valid for arithmetic operators, we introduce two new sets of macros: ARM_HAVE_NEON_<MODE>_LDST true if Neon has vector load/store instructions for <MODE> ARM_HAVE_<MODE>_LDST true if any vector extension has vector load/store instructions for <MODE> We move the movmisalign<mode> expander from neon.md to vec-commond.md, and replace the TARGET_NEON enabler with ARM_HAVE_<MODE>_LDST. The patch also updates the mve-vneg.c test to scan for the better code generation when loading and storing the vectors involved: it checks that no 'orr' instruction is generated to cope with misalignment at runtime. This test was chosen among the other mve tests, but any other should be OK. Using a plain vector copy loop (dest[i] = a[i]) is not a good test because the compiler chooses to use memcpy. For instance we now generate: test_vneg_s32x4: vldrw.32 q3, [r1] vneg.s32 q3, q3 vstrw.32 q3, [r0] bx lr instead of: test_vneg_s32x4: orr r3, r1, r0 lsls r3, r3, gcc-mirror#28 bne .L15 vldrw.32 q3, [r1] vneg.s32 q3, q3 vstrw.32 q3, [r0] bx lr .L15: push {r4, r5} ldrd r2, r3, [r1, gcc-mirror#8] ldrd r5, r4, [r1] rsbs r2, r2, #0 rsbs r5, r5, #0 rsbs r4, r4, #0 rsbs r3, r3, #0 strd r5, r4, [r0] pop {r4, r5} strd r2, r3, [r0, gcc-mirror#8] bx lr 2021-01-12 Christophe Lyon <christophe.lyon@linaro.org> PR target/97875 gcc/ * config/arm/arm.h (ARM_HAVE_NEON_V8QI_LDST): New macro. (ARM_HAVE_NEON_V16QI_LDST, ARM_HAVE_NEON_V4HI_LDST): Likewise. (ARM_HAVE_NEON_V8HI_LDST, ARM_HAVE_NEON_V2SI_LDST): Likewise. (ARM_HAVE_NEON_V4SI_LDST, ARM_HAVE_NEON_V4HF_LDST): Likewise. (ARM_HAVE_NEON_V8HF_LDST, ARM_HAVE_NEON_V4BF_LDST): Likewise. (ARM_HAVE_NEON_V8BF_LDST, ARM_HAVE_NEON_V2SF_LDST): Likewise. (ARM_HAVE_NEON_V4SF_LDST, ARM_HAVE_NEON_DI_LDST): Likewise. (ARM_HAVE_NEON_V2DI_LDST): Likewise. (ARM_HAVE_V8QI_LDST, ARM_HAVE_V16QI_LDST): Likewise. (ARM_HAVE_V4HI_LDST, ARM_HAVE_V8HI_LDST): Likewise. (ARM_HAVE_V2SI_LDST, ARM_HAVE_V4SI_LDST, ARM_HAVE_V4HF_LDST): Likewise. (ARM_HAVE_V8HF_LDST, ARM_HAVE_V4BF_LDST, ARM_HAVE_V8BF_LDST): Likewise. (ARM_HAVE_V2SF_LDST, ARM_HAVE_V4SF_LDST, ARM_HAVE_DI_LDST): Likewise. (ARM_HAVE_V2DI_LDST): Likewise. * config/arm/mve.md (*movmisalign<mode>_mve_store): New pattern. (*movmisalign<mode>_mve_load): New pattern. * config/arm/neon.md (movmisalign<mode>): Move to ... * config/arm/vec-common.md: ... here. PR target/97875 gcc/testsuite/ * gcc.target/arm/simd/mve-vneg.c: Update test.

This patch fixes PR99748 which shows us trying to pass the argument to __aeabi_f2iz in the VFP register s0 when the library function is expecting to use the GPR r0. It also fixes the __aeabi_f2uiz case which was broken in the same way. For the testcase in the PR, here is the code we generate before the patch (with -mfloat-abi=hard -march=armv8.1-m.main+mve -O0): main: push {r7, lr} sub sp, sp, gcc-mirror#8 add r7, sp, #0 mov r3, #1065353216 str r3, [r7, #4] @ float vldr.32 s0, [r7, #4] bl __aeabi_f2iz mov r3, r0 cmp r3, #1 [...] This becomes: main: push {r7, lr} sub sp, sp, gcc-mirror#8 add r7, sp, #0 mov r3, #1065353216 str r3, [r7, #4] @ float ldr r0, [r7, #4] @ float bl __aeabi_f2iz mov r3, r0 cmp r3, #1 [...] after the patch. We see a similar change for the same testcase with a cast to unsigned instead of int. gcc/ChangeLog: PR target/99748 * config/arm/arm.c (arm_libcall_uses_aapcs_base): Also use base PCS for [su]fix_optab.

This patch fixes PR99748 which shows us trying to pass the argument to __aeabi_f2iz in the VFP register s0 when the library function is expecting to use the GPR r0. It also fixes the __aeabi_f2uiz case which was broken in the same way. For the testcase in the PR, here is the code we generate before the patch (with -mfloat-abi=hard -march=armv8.1-m.main+mve -O0): main: push {r7, lr} sub sp, sp, gcc-mirror#8 add r7, sp, #0 mov r3, #1065353216 str r3, [r7, gcc-mirror#4] @ float vldr.32 s0, [r7, gcc-mirror#4] bl __aeabi_f2iz mov r3, r0 cmp r3, #1 [...] This becomes: main: push {r7, lr} sub sp, sp, gcc-mirror#8 add r7, sp, #0 mov r3, #1065353216 str r3, [r7, gcc-mirror#4] @ float ldr r0, [r7, gcc-mirror#4] @ float bl __aeabi_f2iz mov r3, r0 cmp r3, #1 [...] after the patch. We see a similar change for the same testcase with a cast to unsigned instead of int. gcc/ChangeLog: PR target/99748 * config/arm/arm.c (arm_libcall_uses_aapcs_base): Also use base PCS for [su]fix_optab. (cherry picked from commit 16ea7f5)

The fixed error is: ==21166==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs operator delete) on 0x60300000d900 #0 0x7367d7 in operator delete(void*, unsigned long) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/libsanitizer/asan/asan_new_delete.cpp:172 #1 0x3b82e6e in pointer_equiv_analyzer::~pointer_equiv_analyzer() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:161 #2 0x3b83387 in hybrid_folder::~hybrid_folder() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:517 #3 0x3b83387 in execute_early_vrp /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:686 #4 0x1790611 in execute_one_pass(opt_pass*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2567 gcc-mirror#5 0x1792003 in execute_pass_list_1 /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2656 gcc-mirror#6 0x1792029 in execute_pass_list_1 /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2657 gcc-mirror#7 0x179209f in execute_pass_list(function*, opt_pass*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:2667 gcc-mirror#8 0x178a5f3 in do_per_function_toporder(void (*)(function*, void*), void*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:1773 gcc-mirror#9 0x1792fac in do_per_function_toporder(void (*)(function*, void*), void*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/plugin.h:191 gcc-mirror#10 0x1792fac in execute_ipa_pass_list(opt_pass*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/passes.c:3001 gcc-mirror#11 0xc525fc in ipa_passes /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2154 gcc-mirror#12 0xc525fc in symbol_table::compile() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2289 gcc-mirror#13 0xc5a096 in symbol_table::compile() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2269 gcc-mirror#14 0xc5a096 in symbol_table::finalize_compilation_unit() /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/cgraphunit.c:2537 gcc-mirror#15 0x1a7a17c in compile_file /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/toplev.c:482 gcc-mirror#16 0x69c758 in do_compile /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/toplev.c:2210 gcc-mirror#17 0x69c758 in toplev::main(int, char**) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/toplev.c:2349 gcc-mirror#18 0x6a932a in main /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/main.c:39 gcc-mirror#19 0x7ffff7820b34 in __libc_start_main ../csu/libc-start.c:332 gcc-mirror#20 0x6aa5fd in _start (/home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/objdir/gcc/cc1+0x6aa5fd) 0x60300000d900 is located 0 bytes inside of 32-byte region [0x60300000d900,0x60300000d920) allocated by thread T0 here: #0 0x735ab7 in operator new[](unsigned long) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/libsanitizer/asan/asan_new_delete.cpp:102 #1 0x3b82dac in pointer_equiv_analyzer::pointer_equiv_analyzer(gimple_ranger*) /home/marxin/BIG/buildbot/buildworker/marxinbox-gcc-asan/build/gcc/gimple-ssa-evrp.c:156 gcc/ChangeLog: * gimple-ssa-evrp.c (pointer_equiv_analyzer::~pointer_equiv_analyzer): Use delete[].

This patch adds vec_unpack<US>_hi_<mode>, vec_unpack<US>_lo_<mode>, vec_pack_trunc_<mode> patterns for MVE. It does so by moving the unpack patterns from neon.md to vec-common.md, while adding them support for MVE. The pack expander is derived from the Neon one (which in turn is renamed into neon_quad_vec_pack_trunc_<mode>). The patch introduces mve_vec_unpack<US>_lo_<mode> and mve_vec_unpack<US>_hi_<mode> which are similar to their Neon counterparts, except for the assembly syntax. The patch introduces mve_vec_pack_trunc_lo_<mode> to avoid the need for a zero-initialized temporary, which is needed if the vec_pack_trunc_<mode> expander calls @mve_vmovn[bt]q_<supf><mode> instead. With this patch, we can now vectorize the 16 and 8-bit versions of vclz and vshl, although the generated code could still be improved. For test_clz_s16, we now generate vldrh.16 q3, [r1] vmovlb.s16 q2, q3 vmovlt.s16 q3, q3 vclz.i32 q2, q2 vclz.i32 q3, q3 vmovnb.i32 q1, q2 vmovnt.i32 q1, q3 vstrh.16 q1, [r0] which could be improved to vldrh.16 q3, [r1] vclz.i16 q1, q3 vstrh.16 q1, [r0] if we could avoid the need for unpack/pack steps. For reference, clang-12 generates: vldrh.s32 q0, [r1] vldrh.s32 q1, [r1, gcc-mirror#8] vclz.i32 q0, q0 vstrh.32 q0, [r0] vclz.i32 q0, q1 vstrh.32 q0, [r0, gcc-mirror#8] 2021-06-11 Christophe Lyon <christophe.lyon@linaro.org> gcc/ * config/arm/mve.md (mve_vec_unpack<US>_lo_<mode>): New pattern. (mve_vec_unpack<US>_hi_<mode>): New pattern. (@mve_vec_pack_trunc_lo_<mode>): New pattern. (mve_vmovntq_<supf><mode>): Prefix with '@'. * config/arm/neon.md (vec_unpack<US>_hi_<mode>): Move to vec-common.md. (vec_unpack<US>_lo_<mode>): Likewise. (vec_pack_trunc_<mode>): Rename to neon_quad_vec_pack_trunc_<mode>. * config/arm/vec-common.md (vec_unpack<US>_hi_<mode>): New pattern. (vec_unpack<US>_lo_<mode>): New. (vec_pack_trunc_<mode>): New. gcc/testsuite/ * gcc.target/arm/simd/mve-vclz.c: Update expected results. * gcc.target/arm/simd/mve-vshl.c: Likewise. * gcc.target/arm/simd/mve-vec-pack.c: New test. * gcc.target/arm/simd/mve-vec-unpack.c: New test.

The current restriction on folding memcpy to a single element of size MOVE_MAX is excessively cautious on most machines and limits some significant further optimizations. So relax the restriction provided the copy size does not exceed MOVE_MAX * MOVE_RATIO and that a SET insn exists for moving the value into machine registers. Note that there were already checks in place for having misaligned move operations when one or more of the operands were unaligned. On Arm this now permits optimizing uint64_t bar64(const uint8_t *rData1) { uint64_t buffer; memcpy(&buffer, rData1, sizeof(buffer)); return buffer; } from ldr r2, [r0] @ unaligned sub sp, sp, gcc-mirror#8 ldr r3, [r0, #4] @ unaligned strd r2, [sp] ldrd r0, [sp] add sp, sp, gcc-mirror#8 to mov r3, r0 ldr r0, [r0] @ unaligned ldr r1, [r3, #4] @ unaligned PR target/102125 - (ARM Cortex-M3 and newer) missed optimization. memcpy not needed operations gcc/ChangeLog: PR target/102125 * gimple-fold.c (gimple_fold_builtin_memory_op): Allow folding memcpy if the size is not more than MOVE_MAX * MOVE_RATIO.

Fixes: ==129444==ERROR: AddressSanitizer: global-buffer-overflow on address 0x00000666ca5c at pc 0x000000ef094b bp 0x7fffffff8180 sp 0x7fffffff8178 READ of size 4 at 0x00000666ca5c thread T0 #0 0xef094a in parse_optimize_options ../../gcc/d/d-attribs.cc:855 #1 0xef0d36 in d_handle_optimize_attribute ../../gcc/d/d-attribs.cc:916 #2 0xef107e in d_handle_optimize_attribute ../../gcc/d/d-attribs.cc:887 #3 0xff85b1 in decl_attributes(tree_node**, tree_node*, int, tree_node*) ../../gcc/attribs.c:829 #4 0xef2a91 in apply_user_attributes(Dsymbol*, tree_node*) ../../gcc/d/d-attribs.cc:427 gcc-mirror#5 0xf7b7f3 in get_symbol_decl(Declaration*) ../../gcc/d/decl.cc:1346 gcc-mirror#6 0xf87bc7 in get_symbol_decl(Declaration*) ../../gcc/d/decl.cc:967 gcc-mirror#7 0xf87bc7 in DeclVisitor::visit(FuncDeclaration*) ../../gcc/d/decl.cc:808 gcc-mirror#8 0xf83db5 in DeclVisitor::build_dsymbol(Dsymbol*) ../../gcc/d/decl.cc:146 for the following test-case: gcc/testsuite/gdc.dg/attr_optimize1.d. gcc/d/ChangeLog: * d-attribs.cc (parse_optimize_options): Check index before accessing cl_options.

Fixes: ==129444==ERROR: AddressSanitizer: global-buffer-overflow on address 0x00000666ca5c at pc 0x000000ef094b bp 0x7fffffff8180 sp 0x7fffffff8178 READ of size 4 at 0x00000666ca5c thread T0 #0 0xef094a in parse_optimize_options ../../gcc/d/d-attribs.cc:855 #1 0xef0d36 in d_handle_optimize_attribute ../../gcc/d/d-attribs.cc:916 gcc-mirror#2 0xef107e in d_handle_optimize_attribute ../../gcc/d/d-attribs.cc:887 gcc-mirror#3 0xff85b1 in decl_attributes(tree_node**, tree_node*, int, tree_node*) ../../gcc/attribs.c:829 gcc-mirror#4 0xef2a91 in apply_user_attributes(Dsymbol*, tree_node*) ../../gcc/d/d-attribs.cc:427 gcc-mirror#5 0xf7b7f3 in get_symbol_decl(Declaration*) ../../gcc/d/decl.cc:1346 gcc-mirror#6 0xf87bc7 in get_symbol_decl(Declaration*) ../../gcc/d/decl.cc:967 gcc-mirror#7 0xf87bc7 in DeclVisitor::visit(FuncDeclaration*) ../../gcc/d/decl.cc:808 gcc-mirror#8 0xf83db5 in DeclVisitor::build_dsymbol(Dsymbol*) ../../gcc/d/decl.cc:146 for the following test-case: gcc/testsuite/gdc.dg/attr_optimize1.d. gcc/d/ChangeLog: * d-attribs.cc (parse_optimize_options): Check index before accessing cl_options.

…imize or target pragmas [PR103012] The following testcases ICE when an optimize or target pragma is followed by a long line (4096+ chars). This is because on such long lines we can't use columns anymore, but the cpp_define calls performed by c_cpp_builtins_optimize_pragma or from the backend hooks for target pragma are done on temporary buffers and expect to get columns from whatever line they appear on (which happens to be the long line after optimize/target pragma), and we run into: #0 fancy_abort (file=0x3abec67 "../../libcpp/line-map.c", line=502, function=0x3abecfc "linemap_add") at ../../gcc/diagnostic.c:1986 #1 0x0000000002e7c335 in linemap_add (set=0x7ffff7fca000, reason=LC_RENAME, sysp=0, to_file=0x41287a0 "pr103012.i", to_line=3) at ../../libcpp/line-map.c:502 #2 0x0000000002e7cc24 in linemap_line_start (set=0x7ffff7fca000, to_line=3, max_column_hint=128) at ../../libcpp/line-map.c:827 #3 0x0000000002e7ce2b in linemap_position_for_column (set=0x7ffff7fca000, to_column=1) at ../../libcpp/line-map.c:898 #4 0x0000000002e771f9 in _cpp_lex_direct (pfile=0x40c3b60) at ../../libcpp/lex.c:3592 gcc-mirror#5 0x0000000002e76c3e in _cpp_lex_token (pfile=0x40c3b60) at ../../libcpp/lex.c:3394 gcc-mirror#6 0x0000000002e610ef in lex_macro_node (pfile=0x40c3b60, is_def_or_undef=true) at ../../libcpp/directives.c:601 gcc-mirror#7 0x0000000002e61226 in do_define (pfile=0x40c3b60) at ../../libcpp/directives.c:639 gcc-mirror#8 0x0000000002e610b2 in run_directive (pfile=0x40c3b60, dir_no=0, buf=0x7fffffffd430 "__OPTIMIZE__ 1\n", count=14) at ../../libcpp/directives.c:589 gcc-mirror#9 0x0000000002e650c1 in cpp_define (pfile=0x40c3b60, str=0x2f784d1 "__OPTIMIZE__") at ../../libcpp/directives.c:2513 gcc-mirror#10 0x0000000002e65100 in cpp_define_unused (pfile=0x40c3b60, str=0x2f784d1 "__OPTIMIZE__") at ../../libcpp/directives.c:2522 gcc-mirror#11 0x0000000000f50685 in c_cpp_builtins_optimize_pragma (pfile=0x40c3b60, prev_tree=<optimization_node 0x7fffea042000>, cur_tree=<optimization_node 0x7fffea042020>) at ../../gcc/c-family/c-cppbuiltin.c:600 assertion that LC_RENAME doesn't happen first. I think the right fix is emit those predefined macros upon optimize/target pragmas with BUILTINS_LOCATION, like we already do for those macros at the start of the TU, they don't appear in columns of the next line after it. Another possibility would be to force them at the location of the pragma. 2021-12-30 Jakub Jelinek <jakub@redhat.com> PR c++/103012 gcc/ * config/i386/i386-c.c (ix86_pragma_target_parse): Perform cpp_define/cpp_undef calls with forced token locations BUILTINS_LOCATION. * config/arm/arm-c.c (arm_pragma_target_parse): Likewise. * config/aarch64/aarch64-c.c (aarch64_pragma_target_parse): Likewise. * config/s390/s390-c.c (s390_pragma_target_parse): Likewise. gcc/c-family/ * c-cppbuiltin.c (c_cpp_builtins_optimize_pragma): Perform cpp_define_unused/cpp_undef calls with forced token locations BUILTINS_LOCATION. gcc/testsuite/ PR c++/103012 * g++.dg/cpp/pr103012.C: New test. * g++.target/i386/pr103012.C: New test.

…imize or target pragmas [PR103012] The following testcases ICE when an optimize or target pragma is followed by a long line (4096+ chars). This is because on such long lines we can't use columns anymore, but the cpp_define calls performed by c_cpp_builtins_optimize_pragma or from the backend hooks for target pragma are done on temporary buffers and expect to get columns from whatever line they appear on (which happens to be the long line after optimize/target pragma), and we run into: #0 fancy_abort (file=0x3abec67 "../../libcpp/line-map.c", line=502, function=0x3abecfc "linemap_add") at ../../gcc/diagnostic.c:1986 #1 0x0000000002e7c335 in linemap_add (set=0x7ffff7fca000, reason=LC_RENAME, sysp=0, to_file=0x41287a0 "pr103012.i", to_line=3) at ../../libcpp/line-map.c:502 gcc-mirror#2 0x0000000002e7cc24 in linemap_line_start (set=0x7ffff7fca000, to_line=3, max_column_hint=128) at ../../libcpp/line-map.c:827 gcc-mirror#3 0x0000000002e7ce2b in linemap_position_for_column (set=0x7ffff7fca000, to_column=1) at ../../libcpp/line-map.c:898 gcc-mirror#4 0x0000000002e771f9 in _cpp_lex_direct (pfile=0x40c3b60) at ../../libcpp/lex.c:3592 gcc-mirror#5 0x0000000002e76c3e in _cpp_lex_token (pfile=0x40c3b60) at ../../libcpp/lex.c:3394 gcc-mirror#6 0x0000000002e610ef in lex_macro_node (pfile=0x40c3b60, is_def_or_undef=true) at ../../libcpp/directives.c:601 gcc-mirror#7 0x0000000002e61226 in do_define (pfile=0x40c3b60) at ../../libcpp/directives.c:639 gcc-mirror#8 0x0000000002e610b2 in run_directive (pfile=0x40c3b60, dir_no=0, buf=0x7fffffffd430 "__OPTIMIZE__ 1\n", count=14) at ../../libcpp/directives.c:589 gcc-mirror#9 0x0000000002e650c1 in cpp_define (pfile=0x40c3b60, str=0x2f784d1 "__OPTIMIZE__") at ../../libcpp/directives.c:2513 gcc-mirror#10 0x0000000002e65100 in cpp_define_unused (pfile=0x40c3b60, str=0x2f784d1 "__OPTIMIZE__") at ../../libcpp/directives.c:2522 gcc-mirror#11 0x0000000000f50685 in c_cpp_builtins_optimize_pragma (pfile=0x40c3b60, prev_tree=<optimization_node 0x7fffea042000>, cur_tree=<optimization_node 0x7fffea042020>) at ../../gcc/c-family/c-cppbuiltin.c:600 assertion that LC_RENAME doesn't happen first. I think the right fix is emit those predefined macros upon optimize/target pragmas with BUILTINS_LOCATION, like we already do for those macros at the start of the TU, they don't appear in columns of the next line after it. Another possibility would be to force them at the location of the pragma. 2021-12-30 Jakub Jelinek <jakub@redhat.com> PR c++/103012 gcc/ * config/i386/i386-c.c (ix86_pragma_target_parse): Perform cpp_define/cpp_undef calls with forced token locations BUILTINS_LOCATION. * config/arm/arm-c.c (arm_pragma_target_parse): Likewise. * config/aarch64/aarch64-c.c (aarch64_pragma_target_parse): Likewise. * config/s390/s390-c.c (s390_pragma_target_parse): Likewise. gcc/c-family/ * c-cppbuiltin.c (c_cpp_builtins_optimize_pragma): Perform cpp_define_unused/cpp_undef calls with forced token locations BUILTINS_LOCATION. gcc/testsuite/ PR c++/103012 * g++.dg/cpp/pr103012.C: New test. * g++.target/i386/pr103012.C: New test. (cherry picked from commit 1dbe26b)

…04617] On #define A(n) int foo1##n(void) { return 1##n; } #define B(n) A(n##0) A(n#gcc-mirror#1) A(n#gcc-mirror#2) A(n#gcc-mirror#3) A(n#gcc-mirror#4) A(n#gcc-mirror#5) A(n#gcc-mirror#6) A(n#gcc-mirror#7) A(n#gcc-mirror#8) A(n#gcc-mirror#9) #define C(n) B(n##0) B(n#gcc-mirror#1) B(n#gcc-mirror#2) B(n#gcc-mirror#3) B(n#gcc-mirror#4) B(n#gcc-mirror#5) B(n#gcc-mirror#6) B(n#gcc-mirror#7) B(n#gcc-mirror#8) B(n#gcc-mirror#9) #define D(n) C(n##0) C(n#gcc-mirror#1) C(n#gcc-mirror#2) C(n#gcc-mirror#3) C(n#gcc-mirror#4) C(n#gcc-mirror#5) C(n#gcc-mirror#6) C(n#gcc-mirror#7) C(n#gcc-mirror#8) C(n#gcc-mirror#9) #define E(n) D(n##0) D(n#gcc-mirror#1) D(n#gcc-mirror#2) D(n#gcc-mirror#3) D(n#gcc-mirror#4) D(n#gcc-mirror#5) D(n#gcc-mirror#6) D(n#gcc-mirror#7) D(n#gcc-mirror#8) D(n#gcc-mirror#9) E(0) E(1) E(2) D(30) D(31) C(320) C(321) C(322) C(323) C(324) C(325) B(3260) B(3261) B(3262) B(3263) A(32640) A(32641) A(32642) testcase with ./xgcc -B ./ -c -g -fpic -ffat-lto-objects -flto -O0 -o foo1.o foo1.c -ffunction-sections ./xgcc -B ./ -shared -g -fpic -flto -O0 -o foo1.so foo1.o /tmp/ccTW8mBm.debug.temp.o: file not recognized: file format not recognized (testcase too slow to be included into testsuite). The problem is clearly reported by readelf: readelf: foo1.o.debug.temp.o: Warning: Section 2 has an out of range sh_link value of 65321 readelf: foo1.o.debug.temp.o: Warning: Section 5 has an out of range sh_link value of 65321 readelf: foo1.o.debug.temp.o: Warning: Section 10 has an out of range sh_link value of 65323 readelf: foo1.o.debug.temp.o: Warning: [ 2]: Link field (65321) should index a symtab section. readelf: foo1.o.debug.temp.o: Warning: [ 5]: Link field (65321) should index a symtab section. readelf: foo1.o.debug.temp.o: Warning: [10]: Link field (65323) should index a string section. because simple_object_elf_copy_lto_debug_sections doesn't adjust sh_info and sh_link fields in ElfNN_Shdr if they are in between SHN_{LO,HI}RESERVE inclusive. Not adjusting those is incorrect though, SHN_{LO,HI}RESERVE range is only relevant to the 16-bit fields, mainly st_shndx in ElfNN_Sym where if one needs >= SHN_LORESERVE section number, SHN_XINDEX should be used instead and .symtab_shndx section should contain the real section index, and in ElfNN_Ehdr e_shnum and e_shstrndx fields, where if >= SHN_LORESERVE value is needed it should put those into Shdr[0].sh_{size,link}. But, sh_{link,info} are 32-bit fields which can contain any section index. Note, as simple-object-elf.c mentions, binutils from 2.12 to 2.18 (so before 2011) used to mishandle the > 63.75K sections case and assumed there is a hole in between the sections, but what simple_object_elf_copy_lto_debug_sections does wouldn't help in that case for the debug temp object creation, we'd need to detect the case also in that routine and take it into account in the remapping etc. I think it is not worth it given that it is over 10 years, if somebody needs 63.75K or more sections, better use more recent binutils. 2022-02-22 Jakub Jelinek <jakub@redhat.com> PR lto/104617 * simple-object-elf.c (simple_object_elf_match): Fix up URL in comment. (simple_object_elf_copy_lto_debug_sections): Remap sh_info and sh_link even if they are in the SHN_LORESERVE .. SHN_HIRESERVE range (inclusive).

…04617] On #define A(n) int foo1##n(void) { return 1##n; } #define B(n) A(n##0) A(n##1) A(n#gcc-mirror#2) A(n#gcc-mirror#3) A(n#gcc-mirror#4) A(n#gcc-mirror#5) A(n#gcc-mirror#6) A(n#gcc-mirror#7) A(n#gcc-mirror#8) A(n#gcc-mirror#9) #define C(n) B(n##0) B(n##1) B(n#gcc-mirror#2) B(n#gcc-mirror#3) B(n#gcc-mirror#4) B(n#gcc-mirror#5) B(n#gcc-mirror#6) B(n#gcc-mirror#7) B(n#gcc-mirror#8) B(n#gcc-mirror#9) #define D(n) C(n##0) C(n##1) C(n#gcc-mirror#2) C(n#gcc-mirror#3) C(n#gcc-mirror#4) C(n#gcc-mirror#5) C(n#gcc-mirror#6) C(n#gcc-mirror#7) C(n#gcc-mirror#8) C(n#gcc-mirror#9) #define E(n) D(n##0) D(n##1) D(n#gcc-mirror#2) D(n#gcc-mirror#3) D(n#gcc-mirror#4) D(n#gcc-mirror#5) D(n#gcc-mirror#6) D(n#gcc-mirror#7) D(n#gcc-mirror#8) D(n#gcc-mirror#9) E(0) E(1) E(2) D(30) D(31) C(320) C(321) C(322) C(323) C(324) C(325) B(3260) B(3261) B(3262) B(3263) A(32640) A(32641) A(32642) testcase with ./xgcc -B ./ -c -g -fpic -ffat-lto-objects -flto -O0 -o foo1.o foo1.c -ffunction-sections ./xgcc -B ./ -shared -g -fpic -flto -O0 -o foo1.so foo1.o /tmp/ccTW8mBm.debug.temp.o: file not recognized: file format not recognized (testcase too slow to be included into testsuite). The problem is clearly reported by readelf: readelf: foo1.o.debug.temp.o: Warning: Section 2 has an out of range sh_link value of 65321 readelf: foo1.o.debug.temp.o: Warning: Section 5 has an out of range sh_link value of 65321 readelf: foo1.o.debug.temp.o: Warning: Section 10 has an out of range sh_link value of 65323 readelf: foo1.o.debug.temp.o: Warning: [ 2]: Link field (65321) should index a symtab section. readelf: foo1.o.debug.temp.o: Warning: [ 5]: Link field (65321) should index a symtab section. readelf: foo1.o.debug.temp.o: Warning: [10]: Link field (65323) should index a string section. because simple_object_elf_copy_lto_debug_sections doesn't adjust sh_info and sh_link fields in ElfNN_Shdr if they are in between SHN_{LO,HI}RESERVE inclusive. Not adjusting those is incorrect though, SHN_{LO,HI}RESERVE range is only relevant to the 16-bit fields, mainly st_shndx in ElfNN_Sym where if one needs >= SHN_LORESERVE section number, SHN_XINDEX should be used instead and .symtab_shndx section should contain the real section index, and in ElfNN_Ehdr e_shnum and e_shstrndx fields, where if >= SHN_LORESERVE value is needed it should put those into Shdr[0].sh_{size,link}. But, sh_{link,info} are 32-bit fields which can contain any section index. Note, as simple-object-elf.c mentions, binutils from 2.12 to 2.18 (so before 2011) used to mishandle the > 63.75K sections case and assumed there is a hole in between the sections, but what simple_object_elf_copy_lto_debug_sections does wouldn't help in that case for the debug temp object creation, we'd need to detect the case also in that routine and take it into account in the remapping etc. I think it is not worth it given that it is over 10 years, if somebody needs 63.75K or more sections, better use more recent binutils. 2022-02-22 Jakub Jelinek <jakub@redhat.com> PR lto/104617 * simple-object-elf.c (simple_object_elf_match): Fix up URL in comment. (simple_object_elf_copy_lto_debug_sections): Remap sh_info and sh_link even if they are in the SHN_LORESERVE .. SHN_HIRESERVE range (inclusive). (cherry picked from commit 2f59f06)

This patch extends the fix for PR106253 to AArch32. As with AArch64, we were using ACLE intrinsics to vectorise scalar built-ins, even though the two sometimes have different ECF_* flags. (That in turn is because the ACLE intrinsics should follow the instruction semantics as closely as possible, whereas the scalar built-ins follow language specs.) The patch also removes the copysignf built-in, which only existed for this purpose and wasn't a “real” arm_neon.h built-in. Doing this also has the side-effect of enabling vectorisation of rint and roundeven. Logically that should be a separate patch, but making it one would have meant adding a new int iterator for the original set of instructions and then removing it again when including new functions. I've restricted the bswap tests to little-endian because we end up with excessive spilling on big-endian. E.g.: sub sp, sp, gcc-mirror#8 vstr d1, [sp] vldr d16, [sp] vrev16.8 d16, d16 vstr d16, [sp] vldr d0, [sp] add sp, sp, gcc-mirror#8 @ sp needed bx lr Similarly, the copysign tests require little-endian because on big-endian we unnecessarily load the constant from the constant pool: vldr.32 s15, .L3 vdup.32 d0, d7[1] vbsl d0, d2, d1 bx lr .L3: .word -2147483648 gcc/ PR target/106253 * config/arm/arm-builtins.cc (arm_builtin_vectorized_function): Delete. * config/arm/arm-protos.h (arm_builtin_vectorized_function): Delete. * config/arm/arm.cc (TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION): Delete. * config/arm/arm_neon_builtins.def (copysignf): Delete. * config/arm/iterators.md (nvrint_pattern): New attribute. * config/arm/neon.md (<NEON_VRINT:nvrint_pattern><VCVTF:mode>2): New pattern. (l<NEON_VCVT:nvrint_pattern><su_optab><VCVTF:mode><v_cmp_result>2): Likewise. (neon_copysignf<mode>): Rename to... (copysign<mode>3): ...this. gcc/testsuite/ PR target/106253 * gcc.target/arm/vect_unary_1.c: New test. * gcc.target/arm/vect_binary_1.c: Likewise.

Currently SLP tries to force permute operations "down" the graph from loads in the hope of reducing the total number of permutations needed or (in the best case) removing the need for the permutations entirely. This patch tries to extend it as follows: - Allow loads to take a different permutation from the one they started with, rather than choosing between "original permutation" and "no permutation". - Allow changes in both directions, if the target supports the reverse permutation. - Treat the placement of permutations as a two-way dataflow problem: after propagating information from leaves to roots (as now), propagate information back up the graph. - Take execution frequency into account when optimising for speed, so that (for example) permutations inside loops have a higher cost than permutations outside loops. - Try to reduce the total number of permutations when optimising for size, even if that increases the number of permutations on a given execution path. See the big block comment above vect_optimize_slp_pass for a detailed description. The original motivation for doing this was to add a framework that would allow other layout differences in future. The two main ones are: - Make it easier to represent predicated operations, including predicated operations with gaps. E.g.: a[0] += 1; a[1] += 1; a[3] += 1; could be a single load/add/store for SVE. We could handle this by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 } (depending on what's being counted). We might need to move elements between lanes at various points, like with permutes. (This would first mean adding support for stores with gaps.) - Make it easier to switch between an even/odd and unpermuted layout when switching between wide and narrow elements. E.g. if a widening operation produces an even vector and an odd vector, we should try to keep operations on the wide elements in that order rather than force them to be permuted back "in order". To give some examples of what the patch does: int f1(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { a[0] = (b[1] << c[3]) - d[1]; a[1] = (b[0] << c[2]) - d[0]; a[2] = (b[3] << c[1]) - d[3]; a[3] = (b[2] << c[0]) - d[2]; } continues to produce the same code as before when optimising for speed: b, c and d are permuted at load time. But when optimising for size we instead permute c into the same order as b+d and then permute the result of the arithmetic into the same order as a: ldr q1, [x2] ldr q0, [x1] ext v1.16b, v1.16b, v1.16b, gcc-mirror#8 // <------ sshl v0.4s, v0.4s, v1.4s ldr q1, [x3] sub v0.4s, v0.4s, v1.4s rev64 v0.4s, v0.4s // <------ str q0, [x0] ret The following function: int f2(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { a[0] = (b[3] << c[3]) - d[3]; a[1] = (b[2] << c[2]) - d[2]; a[2] = (b[1] << c[1]) - d[1]; a[3] = (b[0] << c[0]) - d[0]; } continues to push the reverse down to just before the store, like the previous code did. In: int f3(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { for (int i = 0; i < 100; ++i) { a[0] = (a[0] + c[3]); a[1] = (a[1] + c[2]); a[2] = (a[2] + c[1]); a[3] = (a[3] + c[0]); c += 4; } } the loads of a are hoisted and the stores of a are sunk, so that only the load from c happens in the loop. When optimising for speed, we prefer to have the loop operate on the reversed layout, changing on entry and exit from the loop: mov x3, x0 adrp x0, .LC0 add x1, x2, 1600 ldr q2, [x0, #:lo12:.LC0] ldr q0, [x3] mov v1.16b, v0.16b tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- .p2align 3,,7 .L6: ldr q1, [x2], 16 add v0.4s, v0.4s, v1.4s cmp x2, x1 bne .L6 mov v1.16b, v0.16b adrp x0, .LC0 ldr q2, [x0, #:lo12:.LC0] tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- str q0, [x3] ret Similarly, for the very artificial testcase: int f4(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { int a0 = a[0]; int a1 = a[1]; int a2 = a[2]; int a3 = a[3]; for (int i = 0; i < 100; ++i) { a0 ^= c[0]; a1 ^= c[1]; a2 ^= c[2]; a3 ^= c[3]; c += 4; for (int j = 0; j < 100; ++j) { a0 += d[1]; a1 += d[0]; a2 += d[3]; a3 += d[2]; d += 4; } b[0] = a0; b[1] = a1; b[2] = a2; b[3] = a3; b += 4; } a[0] = a0; a[1] = a1; a[2] = a2; a[3] = a3; } the a vector in the inner loop maintains the order { 1, 0, 3, 2 }, even though it's part of an SCC that includes the outer loop. In other words, this is a motivating case for not assigning permutes at SCC granularity. The code we get is: ldr q0, [x0] mov x4, x1 mov x5, x0 add x1, x3, 1600 add x3, x4, 1600 .p2align 3,,7 .L11: ldr q1, [x2], 16 sub x0, x1, #1600 eor v0.16b, v1.16b, v0.16b rev64 v0.4s, v0.4s // <--- .p2align 3,,7 .L10: ldr q1, [x0], 16 add v0.4s, v0.4s, v1.4s cmp x0, x1 bne .L10 rev64 v0.4s, v0.4s // <--- add x1, x0, 1600 str q0, [x4], 16 cmp x3, x4 bne .L11 str q0, [x5] ret bb-slp-layout-17.c is a collection of compile tests for problems I hit with earlier versions of the patch. The same prolems might show up elsewhere, but it seemed worth having the test anyway. In slp-11b.c we previously pushed the permutation of the in[i*4] group down from the load to just before the store. That didn't reduce the number or frequency of the permutations (or increase them either). But separating the permute from the load meant that we could no longer use load/store lanes. Whether load/store lanes are a good idea here is another question. If there were two sets of loads, and if we could use a single permutation instead of one per load, then avoiding load/store lanes should be a good thing even under the current abstract cost model. But I think under the current model we should try to avoid splitting up potential load/store lanes groups if there is no specific benefit to the split. Preferring load/store lanes is still a source of missed optimisations that we should fix one day... gcc/ * params.opt (-param=vect-max-layout-candidates=): New parameter. * doc/invoke.texi (vect-max-layout-candidates): Document it. * tree-vectorizer.h (auto_lane_permutation_t): New typedef. (auto_load_permutation_t): Likewise. * tree-vect-slp.cc (vect_slp_node_weight): New function. (slpg_layout_cost): New class. (slpg_vertex): Replace perm_in and perm_out with partition, out_degree, weight and out_weight. (slpg_partition_info, slpg_partition_layout_costs): New classes. (vect_optimize_slp_pass): Likewise, cannibalizing some part of the previous vect_optimize_slp. (vect_optimize_slp): Use it. gcc/testsuite/ * lib/target-supports.exp (check_effective_target_vect_var_shift): Return true for aarch64. * gcc.dg/vect/bb-slp-layout-1.c: New test. * gcc.dg/vect/bb-slp-layout-2.c: New test. * gcc.dg/vect/bb-slp-layout-3.c: New test. * gcc.dg/vect/bb-slp-layout-4.c: New test. * gcc.dg/vect/bb-slp-layout-5.c: New test. * gcc.dg/vect/bb-slp-layout-6.c: New test. * gcc.dg/vect/bb-slp-layout-7.c: New test. * gcc.dg/vect/bb-slp-layout-8.c: New test. * gcc.dg/vect/bb-slp-layout-9.c: New test. * gcc.dg/vect/bb-slp-layout-10.c: New test. * gcc.dg/vect/bb-slp-layout-11.c: New test. * gcc.dg/vect/bb-slp-layout-13.c: New test. * gcc.dg/vect/bb-slp-layout-14.c: New test. * gcc.dg/vect/bb-slp-layout-15.c: New test. * gcc.dg/vect/bb-slp-layout-16.c: New test. * gcc.dg/vect/bb-slp-layout-17.c: New test. * gcc.dg/vect/slp-11b.c: XFAIL SLP test for load-lanes targets.

The aarch64 ISA specification allows a left shift amount to be applied after extension in the range of 0 to 4 (encoded in the imm3 field). This is true for at least the following instructions: * ADD (extend register) * ADDS (extended register) * SUB (extended register) The result of this patch can be seen, when compiling the following code: uint64_t myadd(uint64_t a, uint64_t b) { return a+(((uint8_t)b)<<4); } Without the patch the following sequence will be generated: 0000000000000000 <myadd>: 0: d37c1c21 ubfiz x1, x1, gcc-mirror#4, gcc-mirror#8 4: 8b000020 add x0, x1, x0 8: d65f03c0 ret With the patch the ubfiz will be merged into the add instruction: 0000000000000000 <myadd>: 0: 8b211000 add x0, x0, w1, uxtb gcc-mirror#4 4: d65f03c0 ret gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_uxt_size): fix an off-by-one in checking the permissible shift-amount.

…hook [PR108583] This replaces the custom division hook with just an implementation through add_highpart. For NEON we implement the add highpart (Addition + extraction of the upper highpart of the register in the same precision) as ADD + LSR. This representation allows us to easily optimize the sequence using existing sequences. This gets us a pretty decent sequence using SRA: umull v1.8h, v0.8b, v3.8b umull2 v0.8h, v0.16b, v3.16b add v5.8h, v1.8h, v2.8h add v4.8h, v0.8h, v2.8h usra v1.8h, v5.8h, 8 usra v0.8h, v4.8h, 8 uzp2 v1.16b, v1.16b, v0.16b To get the most optimal sequence however we match (a + ((b + c) >> n)) where n is half the precision of the mode of the operation into addhn + uaddw which is a general good optimization on its own and gets us back to: .L4: ldr q0, [x3] umull v1.8h, v0.8b, v5.8b umull2 v0.8h, v0.16b, v5.16b addhn v3.8b, v1.8h, v4.8h addhn v2.8b, v0.8h, v4.8h uaddw v1.8h, v1.8h, v3.8b uaddw v0.8h, v0.8h, v2.8b uzp2 v1.16b, v1.16b, v0.16b str q1, [x3], 16 cmp x3, x4 bne .L4 For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us: .L3: ld1b z0.h, p0/z, [x0, x3] mul z0.h, p1/m, z0.h, z2.h add z1.h, z0.h, z3.h usra z0.h, z1.h, gcc-mirror#8 lsr z0.h, z0.h, gcc-mirror#8 st1b z0.h, p0, [x0, x3] inch x3 whilelo p0.h, w3, w2 b.any .L3 .L1: ret and to get the most optimal sequence I match (a + b) >> n (same constraint on n) to addhnb which gets us to: .L3: ld1b z0.h, p0/z, [x0, x3] mul z0.h, p1/m, z0.h, z2.h addhnb z1.b, z0.h, z3.h addhnb z0.b, z0.h, z1.h st1b z0.h, p0, [x0, x3] inch x3 whilelo p0.h, w3, w2 b.any .L3 There are multiple RTL representations possible for these optimizations, I did not represent them using a zero_extend because we seem very inconsistent in this in the backend. Since they are unspecs we won't match them from vector ops anyway. I figured maintainers would prefer this, but my maintainer ouija board is still out for repairs :) There are no new test as new correctness tests were added to the mid-end and the existing codegen tests for this already exist. gcc/ChangeLog: PR target/108583 * config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove. (*bitmask_shift_plus<mode>): New. * config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New. (@aarch64_bitmask_udiv<mode>3): Remove. * config/aarch64/aarch64.cc (aarch64_vectorize_can_special_div_by_constant, TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed. (TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT, aarch64_vectorize_preferred_div_as_shifts_over_mult): New.

This patch adds support for xstormy16's swpb (swap bytes) and swpw (swap words) instructions. The most obvious application of these to implement the __builtin_bswap16 and __builtin_bswap32 intrinsics. Currently, __builtin_bswap16 is implemented as: foo: mov r7,r2 shl r7,gcc-mirror#8 shr r2,gcc-mirror#8 or r2,r7 ret but with this patch becomes: foo: swpb r2 ret Likewise, __builtin_bswap32 now becomes: foo: swpb r2 | swpb r3 | swpw r2,r3 ret Finally, the swpw instruction on its own can be used to exchange two word mode registers without a temporary, so a new pattern and peephole2 have been added to catch this. As described in the PR rtl-optimization/106518, register allocation can (in theory) be more efficient on targets that provide a swap/exchange instruction. The slightly unusual swap<mode> naming matches that used in i386.md. 2024-04-26 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog * config/stormy16/stormy16.md (bswaphi2): New define_insn. (bswapsi2): New define_insn. (swaphi): New define_insn to exchange two registers (swpw). (define_peephole2): Recognize exchange of registers as swaphi. gcc/testsuite/ChangeLog * gcc.target/xstormy16/bswap16.c: New test case. * gcc.target/xstormy16/bswap32.c: Likewise. * gcc.target/xstormy16/swpb.c: Likewise. * gcc.target/xstormy16/swpw-1.c: Likewise. * gcc.target/xstormy16/swpw-2.c: Likewise.

This patch is the final piece in the series to improve the ABI issues affecting PR 88873. The previous patches tackled inserting DFmode values into V2DFmode registers, by introducing insvti_{low,high}part patterns. This patch improves the extraction of DFmode values from V2DFmode registers via TImode intermediates. I'd initially thought this would require new extvti_{low,high}part patterns to be defined, but all that's required is to recognize that the SUBREG idioms produced by combine are equivalent to (forms of) vec_select patterns. The target-independent middle-end can't be sure that the appropriate vec_select instruction exists on the target, hence doesn't canonicalize a SUBREG of a vector mode as a vec_select, but the backend can provide a define_split stating where and when this is useful, for example, considering whether the operand is in memory, or whether !TARGET_SSE_MATH and the destination is i387. For pr88873.c, gcc -O2 -march=cascadelake currently generates: foo: vpunpcklqdq %xmm3, %xmm2, %xmm7 vpunpcklqdq %xmm1, %xmm0, %xmm6 vpunpcklqdq %xmm5, %xmm4, %xmm2 vmovdqa %xmm7, -24(%rsp) vmovdqa %xmm6, %xmm1 movq -16(%rsp), %rax vpinsrq $1, %rax, %xmm7, %xmm4 vmovapd %xmm4, %xmm6 vfmadd132pd %xmm1, %xmm2, %xmm6 vmovapd %xmm6, -24(%rsp) vmovsd -16(%rsp), %xmm1 vmovsd -24(%rsp), %xmm0 ret with this patch, we now generate: foo: vpunpcklqdq %xmm1, %xmm0, %xmm6 vpunpcklqdq %xmm3, %xmm2, %xmm7 vpunpcklqdq %xmm5, %xmm4, %xmm2 vmovdqa %xmm6, %xmm1 vfmadd132pd %xmm7, %xmm2, %xmm1 vmovsd %xmm1, %xmm1, %xmm0 vunpckhpd %xmm1, %xmm1, %xmm1 ret The improvement is even more dramatic when compared to the original 29 instructions shown in comment gcc-mirror#8. GCC 13, for example, required 12 transfers to/from memory. 2023-08-04 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog * config/i386/sse.md (define_split): Convert highpart:DF extract from V2DFmode register into a sse2_storehpd instruction. (define_split): Likewise, convert lowpart:DF extract from V2DF register into a sse2_storelpd instruction. gcc/testsuite/ChangeLog * gcc.target/i386/pr88873.c: Tweak to check for improved code.

As discussed in PR104167 (comments gcc-mirror#8 and below), and PR111238, using -Wl,-gc-sections in the libstdc++ testsuite for arm-eabi (cross-toolchain) avoids link failures for a few tests: 27_io/filesystem/path/108636.cc std/time/clock/gps/1.cc std/time/clock/gps/io.cc std/time/clock/tai/1.cc std/time/clock/tai/io.cc std/time/clock/utc/1.cc std/time/clock/utc/io.cc std/time/clock/utc/leap_second_info.cc std/time/exceptions.cc std/time/format.cc std/time/time_zone/get_info_local.cc std/time/time_zone/get_info_sys.cc std/time/tzdb/1.cc std/time/tzdb/leap_seconds.cc std/time/tzdb_list/1.cc std/time/zoned_time/1.cc std/time/zoned_time/custom.cc std/time/zoned_time/io.cc std/time/zoned_traits.cc This patch achieves this by calling GLIBCXX_CHECK_LINKER_FEATURES in cross-build cases, like we already do for native builds. We keep not doing so in Canadian-cross builds. However, this would hide the fact that libstdc++ somehow forces the user to use -Wl,-gc-sections to avoid undefined references to chdir, mkdir, chmod, pathconf, ... so maybe it's better to keep the status quo and not apply this patch? 2023-08-31 Christophe Lyon <christophe.lyon@linaro.org> libstdc++-v3/ChangeLog: PR libstdc++/111238 * configure: Regenerate. * configure.ac: Call GLIBCXX_CHECK_LINKER_FEATURES in cross, non-Canadian builds.

Currently SLP tries to force permute operations "down" the graph from loads in the hope of reducing the total number of permutations needed or (in the best case) removing the need for the permutations entirely. This patch tries to extend it as follows: - Allow loads to take a different permutation from the one they started with, rather than choosing between "original permutation" and "no permutation". - Allow changes in both directions, if the target supports the reverse permutation. - Treat the placement of permutations as a two-way dataflow problem: after propagating information from leaves to roots (as now), propagate information back up the graph. - Take execution frequency into account when optimising for speed, so that (for example) permutations inside loops have a higher cost than permutations outside loops. - Try to reduce the total number of permutations when optimising for size, even if that increases the number of permutations on a given execution path. See the big block comment above vect_optimize_slp_pass for a detailed description. The original motivation for doing this was to add a framework that would allow other layout differences in future. The two main ones are: - Make it easier to represent predicated operations, including predicated operations with gaps. E.g.: a[0] += 1; a[1] += 1; a[3] += 1; could be a single load/add/store for SVE. We could handle this by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 } (depending on what's being counted). We might need to move elements between lanes at various points, like with permutes. (This would first mean adding support for stores with gaps.) - Make it easier to switch between an even/odd and unpermuted layout when switching between wide and narrow elements. E.g. if a widening operation produces an even vector and an odd vector, we should try to keep operations on the wide elements in that order rather than force them to be permuted back "in order". To give some examples of what the patch does: int f1(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { a[0] = (b[1] << c[3]) - d[1]; a[1] = (b[0] << c[2]) - d[0]; a[2] = (b[3] << c[1]) - d[3]; a[3] = (b[2] << c[0]) - d[2]; } continues to produce the same code as before when optimising for speed: b, c and d are permuted at load time. But when optimising for size we instead permute c into the same order as b+d and then permute the result of the arithmetic into the same order as a: ldr q1, [x2] ldr q0, [x1] ext v1.16b, v1.16b, v1.16b, gcc-mirror#8 // <------ sshl v0.4s, v0.4s, v1.4s ldr q1, [x3] sub v0.4s, v0.4s, v1.4s rev64 v0.4s, v0.4s // <------ str q0, [x0] ret The following function: int f2(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { a[0] = (b[3] << c[3]) - d[3]; a[1] = (b[2] << c[2]) - d[2]; a[2] = (b[1] << c[1]) - d[1]; a[3] = (b[0] << c[0]) - d[0]; } continues to push the reverse down to just before the store, like the previous code did. In: int f3(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { for (int i = 0; i < 100; ++i) { a[0] = (a[0] + c[3]); a[1] = (a[1] + c[2]); a[2] = (a[2] + c[1]); a[3] = (a[3] + c[0]); c += 4; } } the loads of a are hoisted and the stores of a are sunk, so that only the load from c happens in the loop. When optimising for speed, we prefer to have the loop operate on the reversed layout, changing on entry and exit from the loop: mov x3, x0 adrp x0, .LC0 add x1, x2, 1600 ldr q2, [x0, #:lo12:.LC0] ldr q0, [x3] mov v1.16b, v0.16b tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- .p2align 3,,7 .L6: ldr q1, [x2], 16 add v0.4s, v0.4s, v1.4s cmp x2, x1 bne .L6 mov v1.16b, v0.16b adrp x0, .LC0 ldr q2, [x0, #:lo12:.LC0] tbl v0.16b, {v0.16b - v1.16b}, v2.16b // <-------- str q0, [x3] ret Similarly, for the very artificial testcase: int f4(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { int a0 = a[0]; int a1 = a[1]; int a2 = a[2]; int a3 = a[3]; for (int i = 0; i < 100; ++i) { a0 ^= c[0]; a1 ^= c[1]; a2 ^= c[2]; a3 ^= c[3]; c += 4; for (int j = 0; j < 100; ++j) { a0 += d[1]; a1 += d[0]; a2 += d[3]; a3 += d[2]; d += 4; } b[0] = a0; b[1] = a1; b[2] = a2; b[3] = a3; b += 4; } a[0] = a0; a[1] = a1; a[2] = a2; a[3] = a3; } the a vector in the inner loop maintains the order { 1, 0, 3, 2 }, even though it's part of an SCC that includes the outer loop. In other words, this is a motivating case for not assigning permutes at SCC granularity. The code we get is: ldr q0, [x0] mov x4, x1 mov x5, x0 add x1, x3, 1600 add x3, x4, 1600 .p2align 3,,7 .L11: ldr q1, [x2], 16 sub x0, x1, #1600 eor v0.16b, v1.16b, v0.16b rev64 v0.4s, v0.4s // <--- .p2align 3,,7 .L10: ldr q1, [x0], 16 add v0.4s, v0.4s, v1.4s cmp x0, x1 bne .L10 rev64 v0.4s, v0.4s // <--- add x1, x0, 1600 str q0, [x4], 16 cmp x3, x4 bne .L11 str q0, [x5] ret bb-slp-layout-17.c is a collection of compile tests for problems I hit with earlier versions of the patch. The same prolems might show up elsewhere, but it seemed worth having the test anyway. In slp-11b.c we previously pushed the permutation of the in[i*4] group down from the load to just before the store. That didn't reduce the number or frequency of the permutations (or increase them either). But separating the permute from the load meant that we could no longer use load/store lanes. Whether load/store lanes are a good idea here is another question. If there were two sets of loads, and if we could use a single permutation instead of one per load, then avoiding load/store lanes should be a good thing even under the current abstract cost model. But I think under the current model we should try to avoid splitting up potential load/store lanes groups if there is no specific benefit to the split. Preferring load/store lanes is still a source of missed optimisations that we should fix one day... gcc/ * params.opt (-param=vect-max-layout-candidates=): New parameter. * doc/invoke.texi (vect-max-layout-candidates): Document it. * tree-vectorizer.h (auto_lane_permutation_t): New typedef. (auto_load_permutation_t): Likewise. * tree-vect-slp.cc (vect_slp_node_weight): New function. (slpg_layout_cost): New class. (slpg_vertex): Replace perm_in and perm_out with partition, out_degree, weight and out_weight. (slpg_partition_info, slpg_partition_layout_costs): New classes. (vect_optimize_slp_pass): Likewise, cannibalizing some part of the previous vect_optimize_slp. (vect_optimize_slp): Use it. gcc/testsuite/ * lib/target-supports.exp (check_effective_target_vect_var_shift): Return true for aarch64. * gcc.dg/vect/bb-slp-layout-1.c: New test. * gcc.dg/vect/bb-slp-layout-2.c: New test. * gcc.dg/vect/bb-slp-layout-3.c: New test. * gcc.dg/vect/bb-slp-layout-4.c: New test. * gcc.dg/vect/bb-slp-layout-5.c: New test. * gcc.dg/vect/bb-slp-layout-6.c: New test. * gcc.dg/vect/bb-slp-layout-7.c: New test. * gcc.dg/vect/bb-slp-layout-8.c: New test. * gcc.dg/vect/bb-slp-layout-9.c: New test. * gcc.dg/vect/bb-slp-layout-10.c: New test. * gcc.dg/vect/bb-slp-layout-11.c: New test. * gcc.dg/vect/bb-slp-layout-13.c: New test. * gcc.dg/vect/bb-slp-layout-14.c: New test. * gcc.dg/vect/bb-slp-layout-15.c: New test. * gcc.dg/vect/bb-slp-layout-16.c: New test. * gcc.dg/vect/bb-slp-layout-17.c: New test. * gcc.dg/vect/slp-11b.c: XFAIL SLP test for load-lanes targets.

Examining the code generated for the following C snippet on a raspberry pi: int popcount_lut8(unsigned *buf, int n) { int cnt=0; unsigned int i; do { i = *buf; cnt += lut[i&255]; cnt += lut[i>>8&255]; cnt += lut[i>>16&255]; cnt += lut[i>>24]; buf++; } while(--n); return cnt; } I was surprised to see following instruction sequence generated by the compiler: mov r5, r2, lsr #8 uxtb r5, r5 This sequence can be performed by a single ARM instruction: uxtb r5, r2, ror #8 The attached patch allows GCC's combine pass to take advantage of ARM's uxtb with rotate functionality to implement the above zero_extract, and likewise to use the sxtb with rotate to implement sign_extract. ARM's uxtb and sxtb can only be used with rotates of 0, 8, 16 and 24, and of these only the 8 and 16 are useful [ror #0 is a nop, and extends with ror #24 can be implemented using regular shifts], so the approach here is to add the six missing but useful instructions as 6 different define_insn in arm.md, rather than try to be clever with new predicates. Later ARM hardware has advanced bit field instructions, and earlier ARM cores didn't support extend-with-rotate, so this appears to only benefit armv6 era CPUs (e.g. the raspberry pi). Patch posted: https://gcc.gnu.org/legacy-ml/gcc-patches/2018-01/msg01339.html Approved by Kyrill Tkachov: https://gcc.gnu.org/legacy-ml/gcc-patches/2018-01/msg01881.html 2024-05-12 Roger Sayle <roger@nextmovesoftware.com> Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> * config/arm/arm.md (*arm_zeroextractsi2_8_8, *arm_signextractsi2_8_8, *arm_zeroextractsi2_8_16, *arm_signextractsi2_8_16, *arm_zeroextractsi2_16_8, *arm_signextractsi2_16_8): New. 2024-05-12 Roger Sayle <roger@nextmovesoftware.com> Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> * gcc.target/arm/extend-ror.c: New test.

Rebase ee patches on gcc14

* * fix undefined references in libgfortran * * fix internal compiler error in gfortran * Revert "* fix internal compiler error in gfortran" This reverts commit 4c81782ca9e75120eb91e1b6b89559f277233621. * * revert changes for undefined references in gfortran * Fix build on x86_64-pc-linux-gnu when host is aarch64-w64-mingw32 (gcc-mirror#6) * Fix x86_64-pc-linux-gnu and aarch64-pc-linux-gnu build --------- Co-authored-by: Evgeny Karpov <eukarpov@gmail.com>

Contracts nonattr coroutine fixes

…o_debug_section [PR116614] cat abc.C #define A(n) struct T##n {} t##n; #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n##7) A(n##8) A(n##9) #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n##7) B(n##8) B(n##9) #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n##7) C(n##8) C(n##9) #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n##7) D(n##8) D(n##9) E(1) E(2) E(3) int main () { return 0; } ./xg++ -B ./ -o abc{.o,.C} -flto -flto-partition=1to1 -O2 -g -fdebug-types-section -c ./xgcc -B ./ -o abc{,.o} -flto -flto-partition=1to1 -O2 (not included in testsuite as it takes a while to compile) FAILs with lto-wrapper: fatal error: Too many copied sections: Operation not supported compilation terminated. /usr/bin/ld: error: lto-wrapper failed collect2: error: ld returned 1 exit status The following patch fixes that. Most of the 64K+ section support for reading and writing was already there years ago (and especially reading used quite often already) and a further bug fixed in it in the PR104617 fix. Yet, the fix isn't solely about removing the if (new_i - 1 >= SHN_LORESERVE) { *err = ENOTSUP; return "Too many copied sections"; } 5 lines, the missing part was that the function only handled reading of the .symtab_shndx section but not copying/updating of it. If the result has less than 64K-epsilon sections, that actually wasn't needed, but e.g. with -fdebug-types-section one can exceed that pretty easily (reported to us on WebKitGtk build on ppc64le). Updating the section is slightly more complicated, because it basically needs to be done in lock step with updating the .symtab section, if one doesn't need to use SHN_XINDEX in there, the section should (or should be updated to) contain SHN_UNDEF entry, otherwise needs to have whatever would be overwise stored but couldn't fit. But repeating due to that all the symtab decisions what to discard and how to rewrite it would be ugly. So, the patch instead emits the .symtab_shndx section (or sections) last and prepares the content during the .symtab processing and in a second pass when going just through .symtab_shndx sections just uses the saved content. 2024-09-07 Jakub Jelinek <jakub@redhat.com> PR lto/116614 * simple-object-elf.c (SHN_COMMON): Align comment with neighbouring comments. (SHN_HIRESERVE): Use uppercase hex digits instead of lowercase for consistency. (simple_object_elf_find_sections): Formatting fixes. (simple_object_elf_fetch_attributes): Likewise. (simple_object_elf_attributes_merge): Likewise. (simple_object_elf_start_write): Likewise. (simple_object_elf_write_ehdr): Likewise. (simple_object_elf_write_shdr): Likewise. (simple_object_elf_write_to_file): Likewise. (simple_object_elf_copy_lto_debug_section): Likewise. Don't fail for new_i - 1 >= SHN_LORESERVE, instead arrange in that case to copy over .symtab_shndx sections, though emit those last and compute their section content when processing associated .symtab sections. Handle simple_object_internal_read failure even in the .symtab_shndx reading case.

…o_debug_section [PR116614] cat abc.C #define A(n) struct T##n {} t##n; #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n##7) A(n##8) A(n##9) #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n##7) B(n##8) B(n##9) #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n##7) C(n##8) C(n##9) #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n##7) D(n##8) D(n##9) E(1) E(2) E(3) int main () { return 0; } ./xg++ -B ./ -o abc{.o,.C} -flto -flto-partition=1to1 -O2 -g -fdebug-types-section -c ./xgcc -B ./ -o abc{,.o} -flto -flto-partition=1to1 -O2 (not included in testsuite as it takes a while to compile) FAILs with lto-wrapper: fatal error: Too many copied sections: Operation not supported compilation terminated. /usr/bin/ld: error: lto-wrapper failed collect2: error: ld returned 1 exit status The following patch fixes that. Most of the 64K+ section support for reading and writing was already there years ago (and especially reading used quite often already) and a further bug fixed in it in the PR104617 fix. Yet, the fix isn't solely about removing the if (new_i - 1 >= SHN_LORESERVE) { *err = ENOTSUP; return "Too many copied sections"; } 5 lines, the missing part was that the function only handled reading of the .symtab_shndx section but not copying/updating of it. If the result has less than 64K-epsilon sections, that actually wasn't needed, but e.g. with -fdebug-types-section one can exceed that pretty easily (reported to us on WebKitGtk build on ppc64le). Updating the section is slightly more complicated, because it basically needs to be done in lock step with updating the .symtab section, if one doesn't need to use SHN_XINDEX in there, the section should (or should be updated to) contain SHN_UNDEF entry, otherwise needs to have whatever would be overwise stored but couldn't fit. But repeating due to that all the symtab decisions what to discard and how to rewrite it would be ugly. So, the patch instead emits the .symtab_shndx section (or sections) last and prepares the content during the .symtab processing and in a second pass when going just through .symtab_shndx sections just uses the saved content. 2024-09-07 Jakub Jelinek <jakub@redhat.com> PR lto/116614 * simple-object-elf.c (SHN_COMMON): Align comment with neighbouring comments. (SHN_HIRESERVE): Use uppercase hex digits instead of lowercase for consistency. (simple_object_elf_find_sections): Formatting fixes. (simple_object_elf_fetch_attributes): Likewise. (simple_object_elf_attributes_merge): Likewise. (simple_object_elf_start_write): Likewise. (simple_object_elf_write_ehdr): Likewise. (simple_object_elf_write_shdr): Likewise. (simple_object_elf_write_to_file): Likewise. (simple_object_elf_copy_lto_debug_section): Likewise. Don't fail for new_i - 1 >= SHN_LORESERVE, instead arrange in that case to copy over .symtab_shndx sections, though emit those last and compute their section content when processing associated .symtab sections. Handle simple_object_internal_read failure even in the .symtab_shndx reading case. (cherry picked from commit bb8dd09)

Implement vddup and vidup using the new MVE builtins framework. We generate better code because we take advantage of the two outputs produced by the v[id]dup instructions. For instance, before: ldr r3, [r0] sub r2, r3, #8 str r2, [r0] mov r2, r3 vddup.u16 q3, r2, #1 now: ldr r2, [r0] vddup.u16 q3, r2, #1 str r2, [r0] 2024-08-21 Christophe Lyon <christophe.lyon@linaro.org> gcc/ * config/arm/arm-mve-builtins-base.cc (class viddup_impl): New. (vddup): New. (vidup): New. * config/arm/arm-mve-builtins-base.def (vddupq): New. (vidupq): New. * config/arm/arm-mve-builtins-base.h (vddupq): New. (vidupq): New. * config/arm/arm_mve.h (vddupq_m): Delete. (vddupq_u8): Delete. (vddupq_u32): Delete. (vddupq_u16): Delete. (vidupq_m): Delete. (vidupq_u8): Delete. (vidupq_u32): Delete. (vidupq_u16): Delete. (vddupq_x_u8): Delete. (vddupq_x_u16): Delete. (vddupq_x_u32): Delete. (vidupq_x_u8): Delete. (vidupq_x_u16): Delete. (vidupq_x_u32): Delete. (vddupq_m_n_u8): Delete. (vddupq_m_n_u32): Delete. (vddupq_m_n_u16): Delete. (vddupq_m_wb_u8): Delete. (vddupq_m_wb_u16): Delete. (vddupq_m_wb_u32): Delete. (vddupq_n_u8): Delete. (vddupq_n_u32): Delete. (vddupq_n_u16): Delete. (vddupq_wb_u8): Delete. (vddupq_wb_u16): Delete. (vddupq_wb_u32): Delete. (vidupq_m_n_u8): Delete. (vidupq_m_n_u32): Delete. (vidupq_m_n_u16): Delete. (vidupq_m_wb_u8): Delete. (vidupq_m_wb_u16): Delete. (vidupq_m_wb_u32): Delete. (vidupq_n_u8): Delete. (vidupq_n_u32): Delete. (vidupq_n_u16): Delete. (vidupq_wb_u8): Delete. (vidupq_wb_u16): Delete. (vidupq_wb_u32): Delete. (vddupq_x_n_u8): Delete. (vddupq_x_n_u16): Delete. (vddupq_x_n_u32): Delete. (vddupq_x_wb_u8): Delete. (vddupq_x_wb_u16): Delete. (vddupq_x_wb_u32): Delete. (vidupq_x_n_u8): Delete. (vidupq_x_n_u16): Delete. (vidupq_x_n_u32): Delete. (vidupq_x_wb_u8): Delete. (vidupq_x_wb_u16): Delete. (vidupq_x_wb_u32): Delete. (__arm_vddupq_m_n_u8): Delete. (__arm_vddupq_m_n_u32): Delete. (__arm_vddupq_m_n_u16): Delete. (__arm_vddupq_m_wb_u8): Delete. (__arm_vddupq_m_wb_u16): Delete. (__arm_vddupq_m_wb_u32): Delete. (__arm_vddupq_n_u8): Delete. (__arm_vddupq_n_u32): Delete. (__arm_vddupq_n_u16): Delete. (__arm_vidupq_m_n_u8): Delete. (__arm_vidupq_m_n_u32): Delete. (__arm_vidupq_m_n_u16): Delete. (__arm_vidupq_n_u8): Delete. (__arm_vidupq_m_wb_u8): Delete. (__arm_vidupq_m_wb_u16): Delete. (__arm_vidupq_m_wb_u32): Delete. (__arm_vidupq_n_u32): Delete. (__arm_vidupq_n_u16): Delete. (__arm_vidupq_wb_u8): Delete. (__arm_vidupq_wb_u16): Delete. (__arm_vidupq_wb_u32): Delete. (__arm_vddupq_wb_u8): Delete. (__arm_vddupq_wb_u16): Delete. (__arm_vddupq_wb_u32): Delete. (__arm_vddupq_x_n_u8): Delete. (__arm_vddupq_x_n_u16): Delete. (__arm_vddupq_x_n_u32): Delete. (__arm_vddupq_x_wb_u8): Delete. (__arm_vddupq_x_wb_u16): Delete. (__arm_vddupq_x_wb_u32): Delete. (__arm_vidupq_x_n_u8): Delete. (__arm_vidupq_x_n_u16): Delete. (__arm_vidupq_x_n_u32): Delete. (__arm_vidupq_x_wb_u8): Delete. (__arm_vidupq_x_wb_u16): Delete. (__arm_vidupq_x_wb_u32): Delete. (__arm_vddupq_m): Delete. (__arm_vddupq_u8): Delete. (__arm_vddupq_u32): Delete. (__arm_vddupq_u16): Delete. (__arm_vidupq_m): Delete. (__arm_vidupq_u8): Delete. (__arm_vidupq_u32): Delete. (__arm_vidupq_u16): Delete. (__arm_vddupq_x_u8): Delete. (__arm_vddupq_x_u16): Delete. (__arm_vddupq_x_u32): Delete. (__arm_vidupq_x_u8): Delete. (__arm_vidupq_x_u16): Delete. (__arm_vidupq_x_u32): Delete.

5hello added 2 commits October 13, 2016 00:37

Create comments

7b14261

Update README

3d573ee

fjtrujy added a commit to fjtrujy/gcc that referenced this pull request May 17, 2024

Merge pull request gcc-mirror#8 from Ziemas/ee-v14.1.0

7502fbf

Rebase ee patches on gcc14

NinaRanns referenced this pull request in NinaRanns/gcc Jul 3, 2024

Merge pull request #8 from iains/contracts-nonattr-coroutine-fixes

73ccb6e

Contracts nonattr coroutine fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic information #8

Basic information #8

5hello commented Oct 12, 2016

Basic information #8

Are you sure you want to change the base?

Basic information #8

Conversation

5hello commented Oct 12, 2016