Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qualcomm MSM8974PRO-AC ARM: gossipd/test/run-bench-find_route broken by a2fa699 #2818

Open
jsarenik opened this issue Jul 19, 2019 · 25 comments
Labels

Comments

@jsarenik
Copy link
Collaborator

Issue and Steps to Reproduce

On armv7l, up-to-date Ubuntu 19.04 I get following error (both when DEVELOPER equals 1 and 0) on version starting a2fa699 up to current master:

# gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Bus error
# echo $?
135

git bisect led me to this:

a2fa699e0ea00d77d755248685bc7cfeec522e2f is the first bad commit
commit a2fa699e0ea00d77d755248685bc7cfeec522e2f
Author: Rusty Russell <rusty@rustcorp.com.au>
Date:   Mon Apr 8 19:28:32 2019 +0930

First I have created a bisect script, then identified that this issue is not present in v0.7.0 but is present in v0.7.1. Here is how I run this bisect:

cat > ~/bisect-gossipd-test-run-bench-find_route.sh <<EOF
#!/bin/sh

{
git clean -xfd
git submodule deinit --all -f
export DEVELOPER=0
./configure || true
make -j4 gossipd/test/run-bench-find_route \
  && gossipd/test/run-bench-find_route
} && echo Success || { echo FAIL; exit 1; }
EOF
chmod a+x ~/bisect*.sh
git bisect start v0.7.1 v0.7.0 --
git bisect run ~/bisect-gossipd-test-run-bench-find_route.sh
git bisect reset

@rustyrussell have a look please as your commit seems to have caused the failing test
@NicolasDorier have you noticed this on any other arm? I can try it with Alpine on the same arm (in chroot).

Have a peaceful weekend!

@NicolasDorier
Copy link
Collaborator

NicolasDorier commented Jul 19, 2019

@jsarenik I only built on arm32, but never tried myself. BTCPayServer does not support clightning on arm32 yet, because we need lightning charge and lightning spark to also support it. (this will be the case in next release)

@jsarenik
Copy link
Collaborator Author

This test does not fail on aarch64 (ARM64) Alpine Linux (musl libc).

$ gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Starting...
1 (1 succeeded) routes in 100 nodes in 1 msec (1576306 nanoseconds per route)
 Length 5: 1
$ echo $?
0
$ uname -a
Linux linaro-developer 4.14.0-qcomlt-arm64 #1 SMP PREEMPT Wed Jan 30 04:14:16 UTC 2019 aarch64 Linux

@ZmnSCPxj
Copy link
Collaborator

The commit itself is large and hard to determine what part introduced the issue. Is it possible to run in gdb and get backtrace?

@cdecker cdecker added the arm label Jul 24, 2019
@jsarenik
Copy link
Collaborator Author

Sure, will do.

@jsarenik
Copy link
Collaborator Author

jsarenik commented Jul 30, 2019

First thing first: I was able to reproduce the issue also on 32-bit ARM running on musl libc. I also did the gdb debugging on this Alpine Linux because there is no issue with debugging symbols like on Ubuntu (which hardwires /lib/ld-linux-armhf.so.3 to binaries on compilation, though this file is a symlink to arm-linux-gnueabihf/ld-2.29.so and the debugging symbols are of course in /usr/lib/debug/lib/arm-linux-gnueabihf/ld-2.29.so which is not found by gdb, and I have tried some magic).

So, here we go:

localhost:~/lightning-auto-test/lightning# uname -a
Linux localhost 3.4.0-lineageos-gb263a89 #1 SMP PREEMPT Wed Oct 24 09:09:32 UTC 2018 armv7l Linux
localhost:~/lightning-auto-test/lightning# git rev-parse --short HEAD          
0ae20399
localhost:~/lightning-auto-test/lightning# ldd gossipd/test/run-bench-find_route
        /lib/ld-musl-armhf.so.1 (0xb6f46000)
        libc.musl-armhf.so.1 => /lib/ld-musl-armhf.so.1 (0xb6f46000)
localhost:~/lightning-auto-test/lightning# gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Bus error
localhost:~/lightning-auto-test/lightning# echo $?
135
localhost:~/lightning-auto-test/lightning# gdb gossipd/test/run-bench-find_route
GNU gdb (GDB) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "armv6-alpine-linux-musleabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from gossipd/test/run-bench-find_route...
(gdb) run
Starting program: /root/lightning-auto-test/lightning/gossipd/test/run-bench-find_route 
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...

Program received signal SIGBUS, Bus error.
0x2a01539c in add (ctx=0xbefffaa0, p=0x2a0c5045, len=25)
    at ccan/ccan/crypto/siphash24/siphash24.c:86
86                              add_64bits(ctx->v, *(const uint64_t *)data);
(gdb) bt                                                                        
#0  0x2a01539c in add (ctx=0xbefffaa0, p=0x2a0c5045, len=25)                    
    at ccan/ccan/crypto/siphash24/siphash24.c:86                               
#1  0x2a015610 in siphash24_update (ctx=0xbefffaa0, p=0x2a0c5045, size=33)     
    at ccan/ccan/crypto/siphash24/siphash24.c:116                              
#2  0x2a015ee8 in siphash24 (seed=0x2a0be228 <siphashseed>, p=0x2a0c5045,      
    size=33) at ccan/ccan/crypto/siphash24/siphash24.c:169                     
#3  0x2a032eac in node_map_hash_key (pc=0x2a0c5045)                            
    at gossipd/test/../routing.c:214
#4  0x2a031f8c in node_map_get (ht=0x2a0c4904, k=0x2a0c5045)                   
    at gossipd/test/../routing.h:130
#5  0x2a032fa8 in get_node (rstate=0x2a0c4804, id=0x2a0c5045)                  
    at gossipd/test/../routing.c:241
#6  0x2a0336d4 in new_chan (rstate=0x2a0c4804, scid=0xbefffb80,                
    id1=0x2a0c5045, id2=0x2a0c5024, satoshis=...)                              
    at gossipd/test/../routing.c:413
#7  0x2a03bfac in add_connection (rstate=0x2a0c4804, nodes=0x2a0c5024, from=1, 
    to=0, base_fee=436, proportional_fee=944, delay=113)
    at gossipd/test/run-bench-find_route.c:119                                 
#8  0x2a03c228 in populate_random_node (rstate=0x2a0c4804, nodes=0x2a0c5024,   
    n=1) at gossipd/test/run-bench-find_route.c:158
#9  0x2a03c638 in main (argc=1, argv=0xbefffd54)                               
    at gossipd/test/run-bench-find_route.c:226
(gdb) c
Continuing.

Program terminated with signal SIGBUS, Bus error.
The program no longer exists.
(gdb) q
localhost:~/lightning-auto-test/lightning# 

All this is on current master (0ae2039).

More thorough debug in the attachment. The file was created by running gdb -batch -n -ex 'set pagination off' -ex 'set logging on' -ex 'echo >> Running the program...\n' -ex run -ex 'echo >> bt\n' -ex bt -ex 'echo >> bt full\n' -ex 'bt full' -ex 'echo >> thread apply all bt full\n' -ex 'thread apply all bt full' -ex 'echo >> c' -ex c --args gossipd/test/run-bench-find_route

gdb.txt

@jsarenik
Copy link
Collaborator Author

Could it be just caused by the funny setup I use (i.e. running chroots on top of Android)?

@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Jul 30, 2019

Can you do disp data at crash point? It might be a "bus error" due to an alignment problem: the device you are running on might not be able to access a u64 at a non-multiple of 4 or 8. The p=0x2a0c5045 means the input address is not aligned, so it might be a misalignment of address that the CPU does not support.

https://en.wikipedia.org/wiki/Bus_error#Unaligned_access

Do you know the exact chipset you are running on?

@jsarenik
Copy link
Collaborator Author

As for the chipset, I hope this helps, if not please hint me what to run.

# cat /proc/cpuinfo 
Processor	: ARMv7 Processor rev 1 (v7l)
processor	: 0
BogoMIPS	: 38.40

processor	: 1
BogoMIPS	: 38.40

processor	: 2
BogoMIPS	: 38.40

processor	: 3
BogoMIPS	: 38.40

Features	: swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt 
CPU implementer	: 0x51
CPU architecture: 7
CPU variant	: 0x2
CPU part	: 0x06f
CPU revision	: 1

Hardware	: Qualcomm MSM8974PRO-AC
Revision	: 0000
Serial		: 0000000000000000

@jsarenik
Copy link
Collaborator Author

Some more hardware hints from the host system shell:

cancro:/ # cat /system/build.prop | grep -i MI                                 
ro.product.model=MI Cancro
ro.product.brand=Xiaomi
ro.product.manufacturer=Xiaomi
ro.build.fingerprint=Xiaomi/lineage_cancro/cancro:7.1.2/NJH47F/7c83ed9cdf:userdebug/release-keys
# from device/xiaomi/cancro/system.prop
rild.libpath=/vendor/lib/libril-qc-qmi-1.so
mm.enable.smoothstreaming=true
ro.fm.transmitter=false
persist.data.qmi.adb_logmask=0
persist.demo.hdmirotationlock=false
ro.hdmi.enable=true
ro.com.google.clientidbase=android-xiaomi
dalvik.vm.heapgrowthlimit=192m
dalvik.vm.heapminfree=2m
ro.bootimage.build.fingerprint=Xiaomi/lineage_cancro/cancro:7.1.2/NJH47F/7c83ed9cdf:userdebug/release-keys

@jsarenik
Copy link
Collaborator Author

@ZmnSCPxj disp data:

In interactive gdb session:

# gdb gossipd/test/run-bench-find_route
GNU gdb (GDB) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "armv6-alpine-linux-musleabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from gossipd/test/run-bench-find_route...
(gdb) run
Starting program: /root/lightning-auto-test/lightning/gossipd/test/run-bench-find_route 
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...

Program received signal SIGBUS, Bus error.
0x2a01539c in add (ctx=0xbefffab0, p=0x2a0c5045, len=25)
    at ccan/ccan/crypto/siphash24/siphash24.c:86
86				add_64bits(ctx->v, *(const uint64_t *)data);
(gdb) disp data
1: data = (const unsigned char *) 0x2a0c504d "1\362$\036\335|\035֏0 \263\004\030\064\205\341\351\374\070}K\261\224K\003\017\rcIt|Ŷ\005\230k\342\237\371\206\375\342\364o\030<ۭ\023\222\313\036\251\350\377t\a\003\001\263\263y\220\060,D|r\003\323\342db(\374\201\255\341\366\233\020\201<\223-\305\357\202\061n\003\322\320%\200\200\340K\234\257V\227\371܄\034\364\330J\370\n\303\345\267-\365\363h\210\311Ti/\003\315\350m\363\370\270ʬ\340VC\333K\307\001\177\321\363/;\002\243uE\206\067֛\232\024[\244\003\360\001\232\243+r\241\341≢)\321$^]$ \363\214ۿu\366\224\353\201\376\260\372\264\260\002\347\374\330\b\312\071V", <incomplete sequence \334>...
(gdb) c
Continuing.

Program terminated with signal SIGBUS, Bus error.
The program no longer exists.
(gdb) q
# 

@jsarenik
Copy link
Collaborator Author

OK, might be with the chip. I have verified that on iMX6 it works well (and it is also 32-bit):

me@mail:~/lightning-auto-test/lightning$ git rev-parse --short HEAD
0ae20399
me@mail:~/lightning-auto-test/lightning$ gossipd/test/run-bench-find_route
gossip_store_compact_offline: 0 deleted, 0 copied
Creating nodes...
Populating nodes...
Starting...
1 (1 succeeded) routes in 100 nodes in 3 msec (3197234 nanoseconds per route)
 Length 8: 1
me@mail:~/lightning-auto-test/lightning$ cat /proc/cpuinfo 
processor	: 0
model name	: ARMv7 Processor rev 10 (v7l)
BogoMIPS	: 3.00
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x2
CPU part	: 0xc09
CPU revision	: 10

Hardware	: Freescale i.MX6 Quad/DualLite (Device Tree)
Revision	: 61013
Serial		: 0000000000000000
me@mail:~/lightning-auto-test/lightning$ uname -a
Linux mail 4.9.150-imx6-sr #1 SMP Sun Jun 9 06:05:39 UTC 2019 armv7l GNU/Linux

@jsarenik jsarenik changed the title ARM: gossipd/test/run-bench-find_route broken by a2fa699 Qualcomm MSM8974PRO-AC ARM: gossipd/test/run-bench-find_route broken by a2fa699 Jul 30, 2019
@jsarenik
Copy link
Collaborator Author

Closing it for now. In case someone else faces the same issue, they can add comments, but now I do not think that this is a general issue.

@jsarenik
Copy link
Collaborator Author

@ZmnSCPxj maybe add a label like wontfix or hw-issue?

@ZmnSCPxj
Copy link
Collaborator

Yes, but ccan "should" work even on CPUs that bus error on unaligned access, that is intent of ccan. What do you think @rustyrussell ? Or move this to https://github.com/rustyrussell/ccan/ ?

@jsarenik
Copy link
Collaborator Author

@ZmnSCPxj any idea how I can reproduce straight on ccan?

@jsarenik
Copy link
Collaborator Author

I will try to run make check on ccan...

@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Jul 31, 2019

Not sure. You might need a boutique test on ccan that specifically performs siphash on an array of char, with the important tweak that you specifically pass in a misaligned pointer e.g. you have:

char buffer[1000];

(void) siphash24(&buffer[1], sizeof(buffer) - 1);

Or maybe malloc it, since char might be allocated by the compiler on unaligned address and the &buffer[1] might accidentally realign. You would have to probe by gdb and breakpoint to the siphash24 function and see the actual pointer. However malloc is assured to return aligned addresses, so specifically misaligning a pointer returned by malloc reliably gives you a misaligned pointer.

@jsarenik
Copy link
Collaborator Author

Yes, but ccan "should" work even on CPUs that bus error on unaligned access, that is intent of ccan. What do you think @rustyrussell ? Or move this to https://github.com/rustyrussell/ccan/ ?

I have made rustyrussell/ccan#84

@jsarenik
Copy link
Collaborator Author

jsarenik commented Aug 4, 2019

Not sure. You might need a boutique test on ccan that specifically performs siphash on an array of char, with the important tweak that you specifically pass in a misaligned pointer e.g. you have:

char buffer[1000];

(void) siphash24(&buffer[1], sizeof(buffer) - 1);

Or maybe malloc it, since char might be allocated by the compiler on unaligned address and the &buffer[1] might accidentally realign. You would have to probe by gdb and breakpoint to the siphash24 function and see the actual pointer. However malloc is assured to return aligned addresses, so specifically misaligning a pointer returned by malloc reliably gives you a misaligned pointer.

Hi @ZmnSCPxj ! Please have a look at https://github.com/jsarenik/siphash24-repro if it makes sense. After compilation it currently ends with Segmentation fault on the CPU which has the alignment issue, but ends successfully on i.MX6. In the meantime I spoke to another man who noticed this issue with alignment on Qualcomm chips years ago and he says it has something to do with the fact it is Krait.

I think that also following issue may be related: tensorflow/tensorflow#19158

@jsarenik
Copy link
Collaborator Author

For reference: https://www.kernel.org/doc/Documentation/unaligned-memory-access.txt

@NicolasDorier
Copy link
Collaborator

this should be reopened until the ccan lib has merged your fixed and updated clightning

@jsarenik
Copy link
Collaborator Author

OK. Reopening. Thanks for feed-back @NicolasDorier !

@jsarenik jsarenik reopened this Aug 16, 2019
@jsarenik
Copy link
Collaborator Author

Just a ping. The bug is still present in current master (ede5f5b).

@jsarenik
Copy link
Collaborator Author

@ZmnSCPxj please have a look if the code in https://github.com/jsarenik/siphash24-repro does make any sense.

@jsarenik
Copy link
Collaborator Author

jsarenik commented Dec 1, 2022

Just an update. I do not have this hardware anymore. It died in the beggining of this year. But not closing (I tried that in the past :)

#2818 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants