Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent SMP crashes on x86_64 #21317

Closed
andrewboie opened this issue Dec 11, 2019 · 15 comments
Closed

intermittent SMP crashes on x86_64 #21317

andrewboie opened this issue Dec 11, 2019 · 15 comments
Assignees
Labels
area: SMP Symmetric multiprocessing area: X86_64 x86-64 Architecture (64-bit) bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@andrewboie
Copy link
Contributor

andrewboie commented Dec 11, 2019

I'm seeing some sporadic crashes on x86_64.

These crashes seem to have the following characteristics:

  1. Instruction pointer (RIP) is NULL
  2. It seems to happen when main is creating new child threads to run test cases, but I haven't been able to pinpoint where or get a stack trace

Here's an example, but I have seen this occur in a lot of tests:

*** Booting Zephyr OS build zephyr-v2.1.0-238-g5abb770487f7  ***
Running test suite test_sprintf
===================================================================
starting test - test_sprintf_double
SKIP - test_sprintf_double
===================================================================
starting test - test_sprintf_integer
E: ***** CPU Page Fault (error code 0x0000000000000010)
E: Supervisor thread executed address 0x0000000000000000
E: PML4E: 0x000000000011a827 Writable, User, Execute Enabled
E: PDPTE: 0x0000000000119827 Writable, User, Execute Enabled
E:   PDE: 0x0000000000118827 Writable, User, Execute Enabled
E:   PTE: Non-present
E: RAX: 0x0000000000000008 RBX: 0x0000000000000000 RCX: 0x00000000000f4240 RDX: 0x0000000000000000
E: RSI: 0x0000000000127000 RDI: 0x0000000000002710 RBP: 0x0000000000000000 RSP: 0x0000000000126fb0
E:  R8: 0x000000000011cd0c  R9: 0x0000000000000000 R10: 0x0000000000000000 R11: 0x0000000000000000
E: R12: 0x0000000001000000 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x0000000000126fb0 RFLAGS: 0x0000000000000202 CS: 0x0018 CR3: 0x000000000010a000
E: call trace:
E: RIP: 0x0000000000000000
E: NULL base ptr
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
E: Current thread: 0x000000000011c8a0 (main)
E: Halting system

Started noticing this after I enabled boot page tables. It's unclear whether my work introduced this, or this was an issue that was already present, although I'm starting to suspect the latter since the code I brought in works great for 32-bit.

Due to sanitycheck automatic retries of failed test cases (see #14173) this has gone undetected in CI.

@andrewboie andrewboie added the bug The issue is a bug, or the PR is fixing a bug label Dec 11, 2019
@andrewboie andrewboie self-assigned this Dec 11, 2019
@andrewboie andrewboie added area: X86 x86 Architecture (32-bit) area: X86_64 x86-64 Architecture (64-bit) and removed area: X86 x86 Architecture (32-bit) labels Dec 11, 2019
@jhedberg jhedberg added the priority: medium Medium impact/importance bug label Dec 17, 2019
@andrewboie
Copy link
Contributor Author

andrewboie commented Dec 18, 2019

At this time I don't believe any of the changes I have made to page tables are the culprit.

I did a git checkout of 1228798 which is the revision that introduced SMP into the qemu_x86_long target. I then did multiple runs of "sanitycheck -n -p qemu_x86_long; stty sane". I am seeing tests randomly failing as well, although the errors are different as the boot page tables have every page accessible and present.

Not only that, but I have found a way to make the problem reproduce much more frequently:

diff --git a/subsys/testsuite/ztest/src/ztest.c b/subsys/testsuite/ztest/src/ztest.c
index 51ddd55bdf..1b016f4a37 100644
--- a/subsys/testsuite/ztest/src/ztest.c
+++ b/subsys/testsuite/ztest/src/ztest.c
@@ -312,6 +312,7 @@ static int run_test(struct unit_test *test)
                Z_TC_END_RESULT(ret, test->name);
        }
 
+       k_sleep(200);
        return ret;
 }

I don't understand what this means yet, but instead of 0-2 failures for any given sanitycheck run, I get more like 8-10. There's no reason why this change should break anything, and indeed it doesn't on our other uniprocessor targets.

I then tested, at the same revision 1228798, whether I can get failures with the older x86_64 port written by @andyross , which has since been removed from Zephyr. Unfortunately, I can. Multiple runs of "sanitycheck -n -p qemu_x86_64; stty sane" also produce about 10 or so errors each run. It's not the same set of tests each time, although some tests tend to fail more often than others.

I then learned that our ARC and Xtensa SMP implementations are not being tested in CI. I filed #21469 for ARC. For Xtensa I don't think we have an emulator which will let us run SMP in CI.

Because I can reproduce similar failures in both the old and new x86_64 ports, I am starting to think this is an issue in the core kernel.

@andrewboie andrewboie added the area: SMP Symmetric multiprocessing label Dec 18, 2019
@andrewboie
Copy link
Contributor Author

andrewboie commented Dec 18, 2019

I've prepared a branch at the revision where SMP was introduced in qemu_x86_long with my change to ztest to add the sleep after running a test.

https://github.com/andrewboie/zephyr/tree/smp-issues

To reproduce what I am seeing, check this out, update west, and then run sanitycheck for either qemu_x86_long (new port) or qemu_x86_64 (old port); both should exhibit random failures.

Example output in this branch:

apboie@shodan:~/projects/zephyr2/zephyr (1) smp-issues /home/apboie/projects/zephyr2/zephyr
$ sanitycheck -p qemu_x86_64 -p qemu_x86_long -n
JOBS: 32
Building testcase defconfigs...
Filtering test cases...
252 tests selected, 140387 tests discarded due to filters
total complete:   42/ 249  16%  failed:    0

qemu_x86_64               tests/kernel/mem_pool/sys_mem_pool/kernel.memory_pool FAILED: failed
	see: sanity-out/qemu_x86_64/tests/kernel/mem_pool/sys_mem_pool/kernel.memory_pool/handler.log

total complete:   78/ 249  31%  failed:    1

qemu_x86_64               tests/lib/ringbuffer/libraries.data_structures     FAILED: unexpected byte
	see: sanity-out/qemu_x86_64/tests/lib/ringbuffer/libraries.data_structures/handler.log

total complete:   86/ 249  34%  failed:    2

qemu_x86_64               tests/posix/fs/portability.posix.newlib            FAILED: unexpected byte
	see: sanity-out/qemu_x86_64/tests/posix/fs/portability.posix.newlib/handler.log

total complete:  153/ 249  61%  failed:    3

qemu_x86_64               tests/kernel/common/kernel.common.misra            FAILED: timeout
	see: sanity-out/qemu_x86_64/tests/kernel/common/kernel.common.misra/handler.log

total complete:  154/ 249  61%  failed:    4

qemu_x86_64               tests/kernel/common/kernel.common                  FAILED: timeout
	see: sanity-out/qemu_x86_64/tests/kernel/common/kernel.common/handler.log

total complete:  170/ 249  68%  failed:    5

qemu_x86_64               tests/kernel/mbox/mbox_usage/kernel.mailbox        FAILED: timeout
	see: sanity-out/qemu_x86_64/tests/kernel/mbox/mbox_usage/kernel.mailbox/handler.log

total complete:  171/ 249  68%  failed:    6

qemu_x86_64               tests/kernel/mem_pool/mem_pool_api/kernel.memory_pool FAILED: timeout
	see: sanity-out/qemu_x86_64/tests/kernel/mem_pool/mem_pool_api/kernel.memory_pool/handler.log

total complete:  172/ 249  69%  failed:    7

qemu_x86_64               tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem.nouser FAILED: timeout
	see: sanity-out/qemu_x86_64/tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem.nouser/handler.log

total complete:  183/ 249  73%  failed:    8

qemu_x86_64               tests/kernel/queue/kernel.queue                    FAILED: timeout
	see: sanity-out/qemu_x86_64/tests/kernel/queue/kernel.queue/handler.log

total complete:  192/ 249  77%  failed:    9

qemu_x86_64               tests/kernel/smp/kernel.multiprocessing            FAILED: timeout
	see: sanity-out/qemu_x86_64/tests/kernel/smp/kernel.multiprocessing/handler.log

total complete:  206/ 249  82%  failed:   10

qemu_x86_64               tests/lib/json/libraries.encoding                  FAILED: timeout
	see: sanity-out/qemu_x86_64/tests/lib/json/libraries.encoding/handler.log

total complete:  214/ 249  85%  failed:   11

qemu_x86_64               tests/posix/fs/portability.posix                   FAILED: timeout
	see: sanity-out/qemu_x86_64/tests/posix/fs/portability.posix/handler.log

total complete:  217/ 249  87%  failed:   12

qemu_x86_64               tests/subsys/logging/log_core/logging.log_core     FAILED: timeout
	see: sanity-out/qemu_x86_64/tests/subsys/logging/log_core/logging.log_core/handler.log

total complete:  234/ 249  93%  failed:   13

qemu_x86_long             tests/kernel/common/kernel.common.misra            FAILED: timeout
	see: sanity-out/qemu_x86_long/tests/kernel/common/kernel.common.misra/handler.log

total complete:  235/ 249  94%  failed:   14

qemu_x86_long             tests/kernel/common/kernel.common                  FAILED: timeout
	see: sanity-out/qemu_x86_long/tests/kernel/common/kernel.common/handler.log

total complete:  236/ 249  94%  failed:   15

qemu_x86_long             tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem FAILED: timeout
	see: sanity-out/qemu_x86_long/tests/kernel/mem_protect/sys_sem/kernel.memory_protection.sys_sem/handler.log

total complete:  241/ 249  96%  failed:   16

qemu_x86_long             tests/lib/json/libraries.encoding                  FAILED: timeout
	see: sanity-out/qemu_x86_long/tests/lib/json/libraries.encoding/handler.log

total complete:  242/ 249  97%  failed:   17

qemu_x86_long             tests/lib/mem_alloc/libraries.libc.minimal         FAILED: timeout
	see: sanity-out/qemu_x86_long/tests/lib/mem_alloc/libraries.libc.minimal/handler.log

total complete:  243/ 249  97%  failed:   18

qemu_x86_long             tests/shell/shell                                  FAILED: timeout
	see: sanity-out/qemu_x86_long/tests/shell/shell/handler.log

total complete:  244/ 249  97%  failed:   19

qemu_x86_long             tests/posix/fs/portability.posix                   FAILED: timeout
	see: sanity-out/qemu_x86_long/tests/posix/fs/portability.posix/handler.log

total complete:  245/ 249  98%  failed:   20

qemu_x86_long             tests/ztest/base/testing.ztest.verbose_0           FAILED: timeout
	see: sanity-out/qemu_x86_long/tests/ztest/base/testing.ztest.verbose_0/handler.log

total complete:  246/ 249  98%  failed:   21

qemu_x86_long             tests/ztest/custom_output/testing.ztest.customized_output FAILED: timeout
	see: sanity-out/qemu_x86_long/tests/ztest/custom_output/testing.ztest.customized_output/handler.log

total complete:  247/ 249  99%  failed:   22

qemu_x86_long             tests/ztest/custom_output/testing.ztest.regular_output FAILED: timeout
	see: sanity-out/qemu_x86_long/tests/ztest/custom_output/testing.ztest.regular_output/handler.log

total complete:  248/ 249  99%  failed:   23

qemu_x86_long             tests/ztest/mock/testing.ztest.mock                FAILED: timeout
	see: sanity-out/qemu_x86_long/tests/ztest/mock/testing.ztest.mock/handler.log

total complete:  249/ 249  100%  failed:   24
225 of 249 tests passed with 0 warnings in 168 seconds

@andrewboie
Copy link
Contributor Author

A really good data point would be to see if this can be observed with nsim_hs_smp or some Xtensa target. I'm trying to globally enable SMP for nsim_hs_smp but I'm having some issues.

@andrewboie
Copy link
Contributor Author

andrewboie commented Dec 18, 2019

Some other observations:

  • Applying the patch to ztest to the current master branch reproduces this on qemu_x86_64
  • All crashes are the CPU trying to execute some address that isn't code. With current x86_64 this results in pagefaults due to our use of the NX bit. before the memory protection stuff landed I've seen invalid opcode, general protection faults, page faults, etc. stack corruption?
  • the memory address tends to be NULL or some short offset from it without the ztest delay. with the ztest delay its usually some ram/rodata location
  • The crashes don't happen at the boundary of the added k_sleep() call

nashif added a commit to nashif/zephyr that referenced this issue Jan 15, 2020
We have some races causing random failures with this platform, set cpu
number to one while we investigate and fix the issue.

Related to zephyrproject-rtos#21317

Signed-off-by: Anas Nashif <anas.nashif@intel.com>
nashif added a commit that referenced this issue Jan 15, 2020
We have some races causing random failures with this platform, set cpu
number to one while we investigate and fix the issue.

Related to #21317

Signed-off-by: Anas Nashif <anas.nashif@intel.com>
@andyross
Copy link
Contributor

OK, we're getting farther. There's another, somewhat similar race involved with waiting for a z_swap() to complete. When swap begins, it does the scheduler work involved with re-queuing the _current thread (inside the scheduler lock, of course), and then it enters arch_switch() to do the actual context switch. But in the cycles between those two steps, the old/switching-from thread is in the queue and able to be run on the other CPU, despite the fact that it won't have its registers saved until somewhere in the middle of arch_switch!

A PoC "fix" for x86_64 appears below. Basically it stuffs a magic cookie into the last field saved in switch, and spins for it to be saved in the scheduler before returning. Applying that results in an almost 100% reliable sanitycheck run (well under a failure per run, anyway, which is around the threshold where measurement gets confounded by the known timing slop).

Now I just need to find a way to do this portably and simply without putting too many weird requirements on the architecture layer...

diff --git a/arch/x86/core/intel64/locore.S b/arch/x86/core/intel64/locore.S
index e38a0fa993..5140c3a0e8 100644
--- a/arch/x86/core/intel64/locore.S
+++ b/arch/x86/core/intel64/locore.S
@@ -212,7 +212,6 @@ z_x86_switch:
        movq %r12, _thread_offset_to_r12(%rsi)
        movq %r13, _thread_offset_to_r13(%rsi)
        movq %r14, _thread_offset_to_r14(%rsi)
-       movq %r15, _thread_offset_to_r15(%rsi)
 #ifdef CONFIG_USERSPACE
        /* We're always in supervisor mode if we get here, the other case
         * is when __resume is invoked from irq_dispatch
@@ -220,6 +219,8 @@ z_x86_switch:
        movq $X86_KERNEL_CS, _thread_offset_to_cs(%rsi)
        movq $X86_KERNEL_DS, _thread_offset_to_ss(%rsi)
 #endif
+       // HACK: move R15 to the end so it's the last value saved
+       movq %r15, _thread_offset_to_r15(%rsi)
        movq %gs:__x86_tss64_t_ist1_OFFSET, %rsp
 
        /* fall through to __resume */
diff --git a/kernel/include/kswap.h b/kernel/include/kswap.h
index 3537a24d21..e98a6cf609 100644
--- a/kernel/include/kswap.h
+++ b/kernel/include/kswap.h
@@ -57,8 +57,19 @@ static ALWAYS_INLINE unsigned int do_swap(unsigned int key,
                k_spin_release(lock);
        }
 
+       // HACK: poison the last value saved in arch_switch, so we can
+       // tell when it's done (i.e. when this magic number gets
+       // clobbered)
+       u64_t old_r15 = old_thread->callee_saved.r15;
+       old_thread->callee_saved.r15 = 0xff55aa11;
+
        new_thread = z_get_next_ready_thread();
 
+       if (new_thread == old_thread) {
+               // (HACK: not switching, put it back)
+               _current->callee_saved.r15 = old_r15;
+       }
+
        if (new_thread != old_thread) {
                sys_trace_thread_switched_out();
 #ifdef CONFIG_TIMESLICING
diff --git a/kernel/sched.c b/kernel/sched.c
index 2ebc9d4be9..007fc5bab4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -248,6 +248,12 @@ static ALWAYS_INLINE struct k_thread *next_up(void)
        }
        z_mark_thread_as_not_queued(thread);
 
+       // HACK: spin for an incoming switch
+       if (thread != _current) {
+               volatile u64_t *xx = &thread->callee_saved.r15;
+               while (*xx == 0xff55aa11);
+       }
+
        return thread;
 #endif
 }

@andyross
Copy link
Contributor

(applying that on top of #21903 that is)

@andyross
Copy link
Contributor

OK, #21903 now has both fixes (and a reversion of the CPU count workaround) and seems much improved to me.

I'm still seeing occasional failures in samples/userspace/prod_consumer, samples/userspace/shared_mem and tests/kernel/common. A quick check running prod_consumer in isolation on an unloaded system has shown 6 failures in 44 runs, all of which seem to be detected as a "no data?" error by the test before a crash (not sure if the crash is the bug or just an effect of the edge case).

Still investigating, but I'd suggest reviewing and merging those fixes ASAP

@andyross
Copy link
Contributor

Well, that one was much quicker. Tiny 1-line SMP race in queue was the source of the prod_consumer failures. Added to the series. The others might have gotten a little better, hard to say.

@andrewboie
Copy link
Contributor Author

The major issues here are now fixed and we've re-enabled SMP in QEMU.

One of the races was fixed with a delay, but I requested we leave this ticket open until IPI can be re-worked to make it bulletproof as described here: #21903 (comment)

@andrewboie
Copy link
Contributor Author

andrewboie commented Jan 22, 2020

I'm still seeing some instabilities in the latest code after merging #21903.

The first is a problem in tests/kernel/sched/schedule_api, where the test_user_k_is_preempt is failing the second k_thread_create() syscall due to "k_thread in use". It's calling k_thread_abort() on a thread object, and then immediately calling k_thread_create() on the same object. either k_thread_abort() needs to block until the target thread actually exits, or we need a new blocking kernel API call to wait for a thread to exit. This doesn't always fail, but the explanation of the failure seems straightforward.

The second is that samples/userspace/shared_mem still appears to be exiting the emulator. I am probably going to have to instrument QEMU itself to do a debug dump when it triple faults to discover more. This happens a lot.

@andrewboie
Copy link
Contributor Author

Saw a similar "k_thread in use" error in tests/kernel/mem_protect/sys_sem, this time from re-using a thread_object from another testcase. this looks like a test case issue that can only be solved by some kind of wait() API:

  1. test_sem_take_timeout creates a child thread, and then blocks on a semaphore that gets given at the end of the body of the child thread function
  2. the next test case test_sem_take_timeout_forever re-uses the thread object. there is a race between the child thread actually exiting, and the call to re-use the object

here there is no explicit k_thread_abort() call. the code is using the semaphore to determine that the thread is finished...but it really isn't. unfortunately I think a fair number of tests do this.

@andrewboie
Copy link
Contributor Author

Anecdotally, the frequency of timing-related QEMU failures seems to have increased, although I don't have concrete data or even know if this affects platforms outside of x86_64

I've observed the following weird crashes in test runs this morning. I have seen these only once:

$ cat crashes.txt 
tests/kernel/mem_pool:

Running test suite test_sys_mem_pool_api
===================================================================
starting test - test_sys_mem_pool_alloc_free
PASS - test_sys_mem_pool_alloc_free
===================================================================
starting test - test_sys_mem_pool_alloc_align4
PASS - test_sys_mem_pool_alloc_align4
===================================================================
starting test - test_sys_mem_pool_min_block_size
E: General protection fault (code 0x0)
00000000100515 RD000000000 RBX: 0x00000000001120a0 RCX: 0x00
X: 0x0000000000    Assertion faile0000d at ZEPHYR_21
E: RSI: 0x00000000001120f8 RDI: 0x00000000001120c8 RBP: 0x00000000001120c8 RSP: 0x0000000000103fab
E:  R8: 0x0000000000000002  R9: 0x0000000000000002 R10: 0x0000000000000000 R11: 0x00000000000BASE/te6
E: R12: 0x0000000000000000 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x00000000em_pool/src/main.c:11200103fa: b RFtest_LAGS: 0x0000000000000206 CS: 0x003b CsysR30
_mem_pool_min_block_siEze: blo: ckRI[P: 0x025305c7fb894853
E: >>> ZEPHYR i] isFATAL ERROR  NULL
0: CPU exception on CPU 0

E: Current thread: 0x00000000001283c0 (unknown)
E: Halting system



tests/lib/fdtable:

===================================================================
starting test - test_z_finalize_fd
PASS - test_z_finalize_fd
===================================================================
starting test - test_z_alloc_fd
PASS - test_z_alloc_fd
===================================================================
starting test - test_z_free_fd
ASSERTION FAIL Recursive spinlock 0x	fee30
E: RAX: 0x000000000000011bd10z_spi00000000n_unlock_valid(l)00004 RBX: 0x000000000011ae80 RCX: 0x00008
E: RSI: 0x000000000000004a RDI: 0x0000000000107783 RBP: 0x0000000000000202 RSP: 0x0000000000126f28
E:  R8: 0x0000000000000005  R9: 0x0000000000000000 R10: 0x0000000000000000 R11: 0x0000000000000000
E: R12: 0x0000000000107670 R13: 0x0000000000000000 R14: 0x00000000fPASS - test_z_free_fd
fffffff R15: 0x0000000000000000
===================================================================
E: RSP: 0x0000000000126f28 RFLTest suite AGStest_fdtable succeeded
: 0x0000000000000206 CS: 0x0018 CR3======================================================: 0x0000000=
PROJECT EXECUTION SUC
CESSFUL
E: RIP: 0x0000000000100dfd
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 1
E: Current thread: 0x000000000011b8a0 (main)
E: Halting system


tests/kernel/common: (this one is baffling)

starting test - test_byteorder_memcpy_swap

    Assertion failed at ZEPHYR_BASE/tests/kernel/common/src/byteorder.c:
    Assertion38failed at t stebyteorder_memcpy_swap:1132746: Swap memcpy failed: _
Swap memcpy failed
FAIL - test_byteorder_memcpy_swap

@andrewboie
Copy link
Contributor Author

andrewboie commented Jan 22, 2020

The issue with samples/userspace/shared_mem is not a triple fault. It's a sample, which doesn't enable CONFIG_TEST, which in turn doesn't enable CONFIG_LOG, which is needed to see fatal errors. I turned that on CONFIG_LOG=y CONFIG_LOG_MINIMAL=y in prj.conf and got more information.

Running this test locally in a terminal, I am able to reproduce it locally outside of sanitycheck, although it can take several iterations before it crashes out, and sometimes instead of crashing it just hangs:

PT Sending Message 1
ENC Thread Received Data
ENC PT MSG: PT: message to encrypt

CT Thread Receivedd Message
CT MSG: ofttbhfspgmeqzos

PT Sending Message 1'
ENC Thread Received Data
ENC PT MSG: ofttbhfspgmeqzos

CT Thread Receivedd Message
CT MSG: messagetoencrypt
E: Page fault at address 0xE: Page fault at address 0x104cb2 (error code 0x3)
10 (error code 0x10)
E: Access violation: supervisor thread not allowed to write
E: Linear address not present in page tables
E: PME: PL4E: 0xML4E: 0x0000000000136827 Writable, User, 0000000000136827 Writable, User, ExecExecutd
ute Enabled
E: PDPTE: 0x0000000000137827 Writable, User, Execute Enabled
E: PDPTE: 0x0000000000137827 Writable, User, Execute Enabled
EE:   PDE: 0x0000000000138827 Writable, User, Execute Enabled
:   PDE: 0x0000000000138827 Writable, User, Execute Enabled
E:   PTE: 0x0000000000104005 Read-only, User, Execute Enabled
E:   PTE: Non-present
E: RAX: 0x000000000000000E: RAX: 0x00000000fb5b7308 RBX: 00 RBX: 0x001004afffffffff RCX: 0x0000000000
x0000000000104ca2 RCX: 0x00000000fffffE: RSI: 0x0000000000124860 RDI: 0x000000fff RDX: 0x000000000012
E: RSI: 0x0000000000000246 RDI: 0x0000000000000000 RBP: 0x001004afffffffff RSP: 0x000012490c RBP: 0x0
000000000013bd88
E:  R8: 0x0000000000000001  R9: 0x0000000000000018 R10: 0x00000000001058ca R11: 0x0000000000000000
E: R12: 0x0000000000000000 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E:  R8: 0x0000000000000046  R9: 0x000000000E: RSP: 0x000000000013bd88 RFLAGS: 0x0000000000000082 CS:0
00000000
E: R12: 0x0000000000000002 RE: RIP: 0x000000000010589b
13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x000000000013bda0 RFLAGS: 0x0000000000000247 CS: 0x0018 CR3: 0x0000000000111000
E: RIP: 0x0000000000000010
E: >>> ZEPHYR FATAL ERROR E: >>> ZEPHYR FATAL ERROR 0: CPU exception0: CPU exception on CPU 0
 on CPU 1
E: Current thread: 0x0000000000122700 (unknown)
E: Current thread: 0x00000E: Halting system
00000122700 (unknown)
E: Halting system
make[3]: *** [zephyr/CMakeFiles/run.dir/build.make:60: zephyr/CMakeFiles/run] Error 1
make[2]: *** [CMakeFiles/Makefile2:1329: zephyr/CMakeFiles/run.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:1336: zephyr/CMakeFiles/run.dir/rule] Error 2
make: *** [Makefile:521: run] Error 2

Two CPUs are crashing simultaneously, which is why the exception info is interleaved and a little garbled. But something really sticks out:

E: >>> ZEPHYR FATAL ERROR E: >>> ZEPHYR FATAL ERROR 0: CPU exception0: CPU exception on CPU 0
 on CPU 1
E: Current thread: 0x0000000000122700 (unknown)
E: Current thread: 0x00000E: Halting system
00000122700 (unknown)
E: Halting system

How is it that the same thread object at 0x122700 is running on two CPUs at the same time?

@andrewboie andrewboie changed the title intermittent crashes on x86_64 intermittent SMP crashes on x86_64 Jan 23, 2020
@13824125580
Copy link

in do_swap function , what about if cpu1 access the readyqueue before the cpu0 do the actual save context operations?
it seems nothing global lock used to guarantee the readyqueue modified by just one core now in "do_swap" function!

@nashif
Copy link
Member

nashif commented Feb 13, 2020

addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: SMP Symmetric multiprocessing area: X86_64 x86-64 Architecture (64-bit) bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

No branches or pull requests

5 participants