Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal instruction occurs in simple scenario #1495

Open
derekbruening opened this issue Nov 28, 2014 · 23 comments
Open

Illegal instruction occurs in simple scenario #1495

derekbruening opened this issue Nov 28, 2014 · 23 comments

Comments

@derekbruening
Copy link
Contributor

From lionghos...@gmail.com on August 04, 2014 01:54:31

For the Summary, please follow the guidelines at https://code.google.com/p/dynamorio/wiki/BugReporting and use one of the CRASH, APP CRASH, HANG, or ASSERT keywords What version of DynamoRIO are you using? In trunk r2734 | derek.bruening | 2014-08-04 00:01:48 +0800 (Mon, 04 Aug 2014) What operating system version are you running on? Fedora 20 64 bit What application are you running? memcache v1.4.20
built from official source http://memcached.org/ Is your application 32-bit or 64-bit? 64bit How are you running the application under DynamoRIO? DynamoRIO_build/bin64/drrun -steal_fds 0 -v -c build/libsyminput.so -- memcached_org/install/bin/memcached What happens when you run without any client? It can run successful.

What happens when you run with debug build ("-debug" flag to drrun/drconfig/drinject)? Illegal instruction What steps will reproduce the problem? 1. Download http://memcached.org/ and build it using default configuration
Downlod dynamorio using svn and drmemory stable from http://www.drmemory.org/ 2. Build my simple client in the attachment.
3. DynamoRIO_build/bin64/drrun -steal_fds 0 -v -c build/libsyminput.so -- memcached_org/install/bin/memcached

  1. send data to memcached by
    echo stats |ncat localhost 11211 What is the expected output? What do you see instead? Is this an application crash, a DynamoRIO crash, a DynamoRIO assert, or a hang (see https://code.google.com/p/dynamorio/wiki/BugReporting and set the title appropriately)? It crash with Illegal instruction. Please provide any additional information below.

Attachment: sym.tar

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=1495

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 04, 2014 17:02:07

I found the program works today. It deterministically failed yesterday.
I will provide more information when I reproduce it again.

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 05, 2014 00:56:37

the call stack trace from core dump is
(gdb) bt
#0 0x00007fc7d6c2420b in __lll_lock_elision () from /lib64/libpthread.so.0
#1 0x000000000040992e in try_read_network (c=0x7fc7c0026200) at ../memcached-1.4.20-src/memcached.c:3862
#2 drive_machine (c=0x7fc7c0026200) at ../memcached-1.4.20-src/memcached.c:4089
#3 event_handler (fd=, which=, arg=0x7fc7c0026200) at ../memcached-1.4.20-src/memcached.c:4353
#4 0x00007fc7d6e40a44 in event_base_loop () from /lib64/libevent-2.0.so.5
#5 0x000000000041095d in worker_libevent (arg=0x1d95490) at ../memcached-1.4.20-src/thread.c:386
#6 0x00007fc7d6c1af33 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fc7d6948ded in clone () from /lib64/libc.so.6

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 05, 2014 01:22:02

I manage to reproduce the issue. I provide a link for the test case to reproduce. https://hkustconnect-my.sharepoint.com/personal/yueqili_connect_ust_hk/_layouts/15/guestaccess.aspx?guestaccesstoken=E6lGK8o%2bbL87X4HZ4isgwwfUn7VxdRyN33Id7A7ggfc%3d&docid=02538b063d67e4024911fa7134b6bd7bd In dr_replay, execute run1.sh to start instrumented memcached and run2.sh to send a command to memcached, then the illegal instruction will occurs.
The binaries are compiled for fedora 20 64 bit. If you use different platforms, you need to make the project again.

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 05, 2014 01:32:04

The problem can be reproduce using trunk development version and stable version.

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on August 05, 2014 09:02:11

The key question is, what is the instruction inside __lll_lock_elision? Is it xbegin? Xref issue #1314 .

Status: Started
Owner: bruen...@google.com
Labels: -Priority-Medium Priority-High Bug-AppCrash

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on August 05, 2014 16:33:53

I can't reproduce: I rebuilt memcached and libsyminput.so, launched memcached under DR with libsyminput.so, and then sent it data: but no repro, whether using release or debug DR (debug DR does complain about a memory leak on exit when I then ^C memcached -- separate issue).

Does your processor have Intel's TSX?

If you can repro again and get gdb on live or the core dump, can you disassemble the pc so we can have the actual instruction?

Status: NeedInfo

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 05, 2014 16:54:17

Does your processor have Intel's TSX?
I don't think my process support TSX. Its core dua, which is years old.

If you can repro again and get gdb on live or the core dump, can you disassemble the pc so we can have the actual instruction?
How to disassemble the pc using gdb?

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on August 05, 2014 16:57:01

How to disassemble the pc using gdb?

x/5i $pc

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 05, 2014 17:17:27

Thank you.
I found that I cannot reproduce now.
Somtimes, the problem occurs constantly. I doubt some code block will be only exercise in some conditions like time period.
I will post you the instruction once I reproduce it again.

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 05, 2014 18:52:05

I reproduce it now.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `./memcached'.
Program terminated with signal SIGILL, Illegal instruction.
#0 0x00007ffec527320b in __lll_lock_elision () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install glibc-2.18-11.fc20.x86_64 glibc-2.18-12.fc20.x86_64 libevent-2.0.21-3.fc20.x86_64
(gdb) bt
#0 0x00007ffec527320b in __lll_lock_elision () from /lib64/libpthread.so.0
#1 0x000000000040992e in try_read_network (c=0x7ffebc026200) at ../memcached-1.4.20-src/memcached.c:3862
#2 drive_machine (c=0x7ffebc026200) at ../memcached-1.4.20-src/memcached.c:4089
#3 event_handler (fd=, which=, arg=0x7ffebc026200) at ../memcached-1.4.20-src/memcached.c:4353
#4 0x00007ffec548fa44 in event_base_loop () from /lib64/libevent-2.0.so.5
#5 0x000000000041095d in worker_libevent (arg=0x153d490) at ../memcached-1.4.20-src/thread.c:386
#6 0x00007ffec5269f33 in start_thread () from /lib64/libpthread.so.0
#7 0x00007ffec4f97ded in clone () from /lib64/libc.so.6
(gdb) x/5i $pc
=> 0x7ffec527320b <__lll_lock_elision+75>: xbeginq 0x7ffec5273211 <__lll_lock_elision+81>
0x7ffec5273211 <__lll_lock_elision+81>: cmp $0xffffffff,%eax
0x7ffec5273214 <__lll_lock_elision+84>: jne 0x7ffec5273200 <__lll_lock_elision+64>
0x7ffec5273216 <__lll_lock_elision+86>: mov (% r8 ),%eax
0x7ffec5273219 <__lll_lock_elision+89>: test %eax,%eax

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 05, 2014 19:18:47

I found the instruction in libpthread.
Is it the lib's bug or the code is invoked by DynamoRIO?
Shall I report bugs to libpthread since fedora is not supposed to use so new isntructions.

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on August 05, 2014 21:01:52

libpthread is supposed to only use xbegin if it determines at init time that the hardware supports lock elision. The question is, did DR somehow mess up its determination on what the hardware supports? That seems unlikely but it's possible. Ideally, if you could reproduce it natively (i.e., not under DR) we would have a definitive answer. If you file the libpthread bug now they will probably blame DR.

Have you ever seen it without a client library?

Qin, any thoughts on whether the private loader might somehow mess up the app libpthread's global vars (I'm looking at __pthread_force_elision)?

lionghostshop, if you still have a core dump or an attached gdb, can you run "p __pthread_force_elision"?

Cc: zhao...@google.com
Labels: OpSys-Linux

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 05, 2014 21:14:14

Have you ever seen it without a client library?
I run DR without my client, the problem still occurs.
Note that without DR, memcached work correctly. I think pthread should be very stable in mainstream fedora.
lionghostshop, if you still have a core dump or an attached gdb, can you run "p __pthread_force_elision"?
It prints
$1 = 0

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 05, 2014 21:36:26

One thing I found is that the problem seems related to certain boots.
Once I can reproduace after booting my laptop, the problem occurs constantly.
In some boots, I cannot reproduce, it constantly work correctly without problem.

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 06, 2014 07:44:56

I found that plugging cable will affect the reproduction of bugs. I use my laptop in two places. Sometimes, I use wireless and sometimes I use wired.
When I use dr to instrument memcached in wifi env, it fails more often.
In wired, it always succeeds so far. In wifi and failure case, plugining network cable will turn it to successful run.

Note that without DR, memcached always works fine.

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on August 06, 2014 07:49:46

I wonder if you could get us access to a core dump? I'd like to see the code cache code for pthread_mutex_lock to investigate a theory that the rip-rel mangling that accesses __pthread_force_elision is failing depending on where libpthread is loaded.

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 06, 2014 07:52:49

I include two core dumps and the binary file of memcache.

Attachment: dr_replay.tar.bz2

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on August 06, 2014 07:57:33

Could you also include your libdynamorio.so binary + debuginfo file, as it's a custom build, right?

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 06, 2014 07:59:28

The package can be downloaded from comment #3 in this thread.

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 06, 2014 17:17:38

Is there a way to set __pthread_force_elision to mask the problem temporarily?

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on August 06, 2014 17:29:12

Is there a way to set __pthread_force_elision to mask the problem temporarily?

I spent a little time on this today and it looks like it's not even checking __pthread_force_elision (or at least not locally). I will have to study the pthreads code again to understand what it's doing in order to posit theories on how things went wrong.

My notes:

% gdb ./memcached core.2328

(gdb) add-symbol-file '/work/dr/bugs/i1495/DR/DynamoRIO_replay_build/lib64/release/libdynamorio.so' 0x71000000

(gdb) disas pthread_mutex_lock
...
0x00007ffec526bff9 <+25>: test %eax,%eax
0x00007ffec526bffb <+27>: jne 0x7ffec526c05d <pthread_mutex_lock+125>
0x00007ffec526bffd <+29>: mov 0x2124e1(%rip),%eax # 0x7ffec547e4e4 <__pthread_force_elision>
0x00007ffec526c003 <+35>: test %eax,%eax
0x00007ffec526c005 <+37>: je 0x7ffec526c030 <pthread_mutex_lock+80>
0x00007ffec526c007 <+39>: test $0x300,%esi
0x00007ffec526c00d <+45>: jne 0x7ffec526c030 <pthread_mutex_lock+80>
0x00007ffec526c00f <+47>: or $0x100,%esi
0x00007ffec526c015 <+53>: mov %esi,0x10(%rdi)
0x00007ffec526c018 <+56>: mov %esi,%edx
0x00007ffec526c01a <+58>: lea 0x16(% r8 ),%rsi
0x00007ffec526c01e <+62>: mov % r8 ,%rdi
0x00007ffec526c021 <+65>: and $0x80,%edx
0x00007ffec526c027 <+71>: jmpq 0x7ffec52731c0 <__lll_lock_elision>
0x00007ffec526c02c <+76>: nopl 0x0(%rax)
0x00007ffec526c030 <+80>: and $0x80,%esi
0x00007ffec526c03b <+91>: xor %eax,%eax
0x00007ffec526c03d <+93>: lock cmpxchg %edi,(% r8 )
0x00007ffec526c042 <+98>: jne 0x7ffec526c16a <_L_lock_814>
0x00007ffec526c048 <+104>: mov %fs:0x2d0,%eax
0x00007ffec526c050 <+112>: mov %eax,0x8(% r8 )
0x00007ffec526c054 <+116>: addl $0x1,0xc(% r8 )
0x00007ffec526c059 <+121>: nop
0x00007ffec526c05a <+122>: xor %eax,%eax
0x00007ffec526c05c <+124>: retq
0x00007ffec526c05d <+125>: cmp $0x100,%eax
0x00007ffec526c062 <+130>: je 0x7ffec526c018 <pthread_mutex_lock+56>

(gdb) p *shared_bb->table[(((0x00007ffec526bffd * 11400714819323198485) >> (64 - shared_bb->hash_bits)) & shared_bb->hash_mask)+0]
$8 = {
tag = 0x7ffec526bffd <pthread_mutex_lock+29> "\213\005\341$!",
flags = 150996528,
size = 25,
prefix_size = 0 '\000',
fcache_extra = 11 '\v',
start_pc = 0x50e39c10 "H\270\344\344G\305\376\177",
in_xlate = {
incoming_stubs = 0x50e6bb80,
translation_info = 0x50e6bb80
},
next_vmarea = 0x50f0b528,
prev_vmarea = 0x50f114e8,
also = {
also_vmarea = 0x0,
flushtime = 0
}
}
(gdb) x/15i 0x50e39c10
0x50e39c10: movabs $0x7ffec547e4e4,%rax
0x50e39c1a: mov (%rax),%eax
0x50e39c1c: test %eax,%eax
0x50e39c1e: je 0x50e39c34
0x50e39c24: jmpq 0x50e54fb5
(gdb) x/4dx 0x7ffec547e4e4
0x7ffec547e4e4 <__pthread_force_elision>: 0x00000000 0x00000000 0x00000000 0x00000000
(gdb) x/4i 0x50e54fb5
0x50e54fb5: addr32 mov %rax,%gs:0x0
0x50e54fbd: movabs $0x50f11490,%rax
0x50e54fc7: jmpq 0x50e2a300
(gdb) p *shared_bb->table[(((0x00007ffec526c007 * 11400714819323198485) >> (64 - shared_bb->hash_bits)) & shared_bb->hash_mask)+0]
$9 = {
tag = 0x0,
(gdb) p *shared_bb->table[(((0x00007ffec526c00f * 11400714819323198485) >> (64 - shared_bb->hash_bits)) & shared_bb->hash_mask)+0]
$10 = {
tag = 0x0,

So it did not go to 0x00007ffec526c007 or 0x00007ffec526c00f.
Yet it did reach __lll_lock_elision (which of course the SIGILL implies):

(gdb) disas __lll_lock_elision
Dump of assembler code for function __lll_lock_elision:
0x00007ffec52731c0 <+0>: movzwl (%rsi),%eax
0x00007ffec52731c3 <+3>: mov %rdi,% r8 (gdb) p _shared_bb->table[(((0x00007ffec52731c0 * 11400714819323198485) >> (64 - shared_bb->hash_bits)) & shared_bb->hash_mask)+0]$11 = {
tag = 0x7ffec52731c0 <__lll_lock_elision> "\017\267\006I\211\370f\205\300~\035\203\350\001f\211\006\277\001",
flags = 150996528,
size = 20,
prefix_size = 0 '\000',
fcache_extra = 8 '\b',
start_pc = 0x510605f0 "\017\267\006I\211\370f\205\300\017\216\r",
in_xlate = {
incoming_stubs = 0x51092eb8,
translation_info = 0x51092eb8
},
(gdb) p *(direct_linkstub_t *)0x51092eb8
$14 = {
cdl = {
l = {
flags = 55297,
cti_offset = 15
},
next_incoming = 0x0
},
target_tag = 0x7ffec52731c0 <_lll_lock_elision> "\017\267\006I\211\370f\205\300~\035\203\350\001f\211\006\277\001",
stub_pc = 0x50f781ee "geH\243"
}
(gdb) x/4i 0x50f781ee
0x50f781ee: addr32 mov %rax,%gs:0x0
0x50f781f6: movabs $0x51092eb8,%rax
0x50f78200: jmpq 0x50e2a300
(gdb) p *(fragment_t
)(0x51092eb8-sizeof(fragment_t))
$15 = {
tag = 0x7ffec526c018 <pthread_mutex_lock+56> "\211\362I\215p\026L\211", <incomplete sequence \342\200>,

So it came through that later entry. Hmm, so it's not checking
__pthread_force_elision? At least not here.

@derekbruening
Copy link
Contributor Author

From lionghos...@gmail.com on August 06, 2014 17:34:10

Will it help if you also check the source code of pthread?

@zhaoqin
Copy link
Contributor

zhaoqin commented Sep 23, 2015

0x00007ffec526bff9 <+25>: test %eax,%eax
0x00007ffec526bffb <+27>: jne 0x7ffec526c05d <pthread_mutex_lock+125>
0x00007ffec526bffd <+29>: mov 0x2124e1(%rip),%eax # 0x7ffec547e4e4 <__pthread_force_elision>
0x00007ffec526c003 <+35>: test %eax,%eax
0x00007ffec526c005 <+37>: je 0x7ffec526c030 <pthread_mutex_lock+80>
0x00007ffec526c007 <+39>: test $0x300,%esi
0x00007ffec526c00d <+45>: jne 0x7ffec526c030 <pthread_mutex_lock+80>
...
0x00007ffec526c05d <+125>: cmp $0x100,%eax
0x00007ffec526c062 <+130>: je 0x7ffec526c018 <pthread_mutex_lock+56>

corresponding to pthread_mutex_lock code in nptl/pthread_mutex_lock.c

  if (__glibc_likely (type == PTHREAD_MUTEX_TIMED_NP))
    {
      FORCE_ELISION (mutex, goto elision);
      ...
    }
#ifdef HAVE_ELISION
  else if (__glibc_likely (type == PTHREAD_MUTEX_TIMED_ELISION_NP))
    {
  elision: __attribute__((unused))
      /* This case can never happen on a system without elision,
         as the mutex type initialization functions will not
         allow to set the elision flags.  */
      /* Don't record owner or users for elision case.  This is a
         tail call.  */
      return LLL_MUTEX_LOCK_ELISION (mutex);
    }
#endif

where FORCE_ELISION (mutex, goto elision) is

/* Automatically enable elision for existing user lock kinds.  */
#define FORCE_ELISION(m, s)                                             \
  if (__pthread_force_elision                                           \
      && (m->__data.__kind & PTHREAD_MUTEX_ELISION_FLAGS_NP) == 0)      \
    {                                                                   \
      mutex->__data.__kind |= PTHREAD_MUTEX_ELISION_NP;                 \
      s;                                                                \
    }

So there are two possible way to get to the lll_lock_elision:

  1. type is PTHREAD_MUTEX_TIMED_NP (=0), and __pthread_forc_elision and some flags are set
  2. type is PTHREAD_MUTEX_TIMED_ELISION_NP (=PTHREAD_MUTEX_TIMED_NP | PTHREAD_MUTEX_ELISION_NP = 0x100)

So it means somehow, the type is set as 0x100

unsigned int type = PTHREAD_MUTEX_TYPE_ELISION (mutex);
/* Don't include NO_ELISION, as that type is always the same
   as the underlying lock type.  */
#define PTHREAD_MUTEX_TYPE_ELISION(m) \
  ((m)->__data.__kind & (127|PTHREAD_MUTEX_ELISION_NP))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants