Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valgrind throwing lots of errors #15

Closed
critiqjo opened this issue Dec 15, 2015 · 19 comments · Fixed by #43
Closed

Valgrind throwing lots of errors #15

critiqjo opened this issue Dec 15, 2015 · 19 comments · Fixed by #43

Comments

@critiqjo
Copy link

When the number of coroutines spawned is greater than a certain value, a lot of "use of uninitialized value" errors are thrown by valgrind.

Errors are thrown only when 15 or more coroutines are spawned:

extern crate coio;

use coio::{Scheduler, sleep_ms};

fn main() {
    Scheduler::new()
        .with_workers(1)
        .run(|| {
            for _ in 0..15 {
                Scheduler::spawn(move || {
                    println!("Heil Hydra");
                    sleep_ms(100);
                });
            }
        })
        .unwrap();
}

If there are more workers, I can spawn more coroutines without errors, but I do reach a limit (eg. 19 for 2 workers). Despite these errors, the program gives the expected output, and exits without any segfaults.

Here are a few lines (out of 11901) from the log:

...
==13148== Warning: client switching stacks?  SP change: 0x5f134e0 --> 0x407bff8
==13148==          to suppress, use: --max-stackframe=32077032 or greater
==13148==          further instances of this message will not be shown.
==13148== Thread 2 Worker 0:
...
==13148== Invalid read of size 8
==13148==    at 0x1451AF: runtime::processor::_$LT$impl$GT$::running::hbdb031c50523a379rtc (processor.rs:190)
==13148==  Address 0x5f13f58 is in a rw- anonymous segment
==13148== 
==13148== Invalid read of size 8
==13148==    at 0x1451B9: runtime::processor::_$LT$impl$GT$::running::hbdb031c50523a379rtc (processor.rs:190)
==13148==  Address 0x5f13f60 is in a rw- anonymous segment
==13148== 
==13148== Invalid read of size 8
==13148==    at 0x1451EA: _$LT$impl$GT$::insert::insert::h14411986088127779969 (lib.rs:133)
==13148==  Address 0x5f14028 is in a rw- anonymous segment
==13148== 
...
==13148== Conditional jump or move depends on uninitialised value(s)
==13148==    at 0x56E26EF: pthread_join (in /usr/lib/libpthread-2.22.so)
==13148==    by 0x140862: thread::_$LT$impl$GT$::join::join::h17510934181584830106 (mod.rs:574)
==13148==    by 0x1407DE: thread::_$LT$impl$GT$::join::join::h10846901181898427765 (mod.rs:604)
==13148==    by 0x113AA5: scheduler::_$LT$impl$GT$::run::run::h12081195173303708932 (scheduler.rs:210)
==13148==    by 0x11219A: main::h28feaf366a048100iaa (main.rs:6)
==13148==    by 0x196324: sys_common::unwind::try::try_fn::h10038815114197766760 (in /home/john/dev/coio-test/target/debug/coio-test)
==13148==    by 0x193B78: __rust_try (in /home/john/dev/coio-test/target/debug/coio-test)
==13148==    by 0x195FC6: rt::lang_start::hbdf4b213a64b8c7394x (in /home/john/dev/coio-test/target/debug/coio-test)
==13148==    by 0x146DE9: main (in /home/john/dev/coio-test/target/debug/coio-test)
==13148== 
...
@zonyitoo
Copy link
Owner

Ah, it seems that nearly every call in this crate will generate a "Invalid read" or "Use of uninitialised value".

Could you please first try to use valgrind to test this simple example program?
https://github.com/zonyitoo/context-rs/blob/master/examples/simple.rs

Let's see whether the problem is in the context-rs crate.

@critiqjo
Copy link
Author

Could you please first try to use valgrind to test this simple example program?

Both simple.rs and asymmetric.rs runs fine without errors. But I think that should be expected since the errors are observed only when there are 15 or more coroutines in the coio example.

I also tried out a slightly modified asymemetric.rs example:

fn main() {
    let mut coros: Vec<Coroutine<i32>> = (0..20).map(|_| Coroutine::spawn(|me| {
        for num in 0..2 {
            me.yield_with(num);
        }
    })).collect();

    for num in 0..3 {
        for coro in coros.iter_mut() {
            println!("{:?}", coro.next());
        }
    }
}

which also runs without errors...

@zonyitoo
Copy link
Owner

That is weird. Is it possible that because of the user space context switching confuses the Valgrind?

@critiqjo
Copy link
Author

I suppose valgrind should be smarter than that! But I'm not experienced in either domains, so can't say for sure!

@zonyitoo
Copy link
Owner

I am now focusing on refactoring the I/O module, so I will be happy if you can help to debug :)

@critiqjo
Copy link
Author

I will be happy to help. But, currently, I have no idea how context is implemented. Where do you suggest I should look to get a high-level view of the design, before jumping deep? (never mind! :) your documentation/comments is pretty extensive!)

@critiqjo
Copy link
Author

I ran it again with --track-origins=yes option (info source). And now, most of the errors are followed by these 2 lines:

==27720==  Uninitialised value was created by a stack allocation
==27720==    at 0x174D91: ??? (_context.S:121)

which is this statement:

mov (RUSTRT_RSP*8)(ARG1), %rsp

And a very few others such as:

==27720==  Uninitialised value was created by a stack allocation
==27720==    at 0x140DE4: scheduler::_$LT$impl$GT$::spawn_opts::spawn_opts::h12050465489478016694 (scheduler.rs:140)
...
==27720==  Uninitialised value was created by a stack allocation
==27720==    at 0x187540: sync::once::_$LT$impl$GT$::call_once::call_once::h13033172402953422293 (once.rs:72)
...
==27720==  Uninitialised value was created by a stack allocation
==27720==    at 0x1413F0: sync::mpsc::channel::channel::h3345162500116368437 (mpsc.rs:108)

The last two are not even related to context.

@zonyitoo
Copy link
Owner

mov (RUSTRT_RSP*8)(ARG1), %rsp

This line is used for restoring the %rsp register for swapping stacks. I can't see any obvious mistakes in this piece of code.

scheduler::_$LT$impl$GT$::spawn_opts::spawn_opts is called when spawning new coroutines, it is in pure Rust code. In this function, the spawn_opts, it will call the sync::mpsc::channel::channel.
The sync::once::_$LT$impl$GT$::call_once::call_once call may be related to the thread_local! macro, which used to store the Processor object and called by spawn_opt.

So these three errors are all related. But I don't know which one is uninitialised.

@critiqjo
Copy link
Author

Another observation: if I reduce the DEFAULT_STACK to 64 * 1024 (instead of 128kB), the number of coroutines can go up to 30 without errors (doubled!). So I doubt it has something to do with the stack pool? (or not, since the modified asymmetric example used more much more stacks-allocations: 2MB x 20)

@zonyitoo
Copy link
Owner

Why not try to disable the stack cache.

https://github.com/zonyitoo/context-rs/blob/master/src/stack.rs#L157

Just try to set RUST_MAX_CACHED_STACKS=0, so it will call mmap to create a new stack on every spawns.

@critiqjo
Copy link
Author

Why not try to disable the stack cache.

Unfortunately, increasing it or disabling it made no difference.

The total stack size without "errors" is just nearly 2 MB; or more precisely, 2 "stacks" less than 2 MB for 256 KB (6 coroutines without errors), 128 KB (14) and 64 KB (30) stack sizes; but 3 "stacks" less for 32 KB (61) stack size.

@zonyitoo
Copy link
Owner

Well, I have no idea what happened right now. Help wanted.

@lhecker
Copy link
Collaborator

lhecker commented Jan 4, 2016

@critiqjo @zonyitoo Did you guys use debug or release builds while testing this with valgrind? In the latter case this might be the issue: rust-lang/rust#5856 (comment)

@zonyitoo
Copy link
Owner

zonyitoo commented Jan 5, 2016

If you use release builds for testing, you would find no symbols in the output logs...

Or you could use --opt-level=3 with -g

@critiqjo
Copy link
Author

critiqjo commented Jan 9, 2016

Sorry for the late reply. I'm not able to test the original example in the machine I use now -- panics on sleep_ms call (backtrace).

Though, if I remember correctly, I got those valgrind errors for both debug and release builds, and I definitely didn't use --opt-level=3 and -g together.

@zonyitoo
Copy link
Owner

zonyitoo commented Jan 9, 2016

panicked at 'Box<Any>' should reason of triggering force unwinding .. I don't know exactly why it happens.

@zonyitoo
Copy link
Owner

In the latest commit in feature-migrate_context1.0 branch, valgrind doesn't show any Unitialized value or Invalid Read or Invalid Write logs. Including the test program in the top of this issue.

@lhecker
Copy link
Collaborator

lhecker commented Feb 13, 2016

Well that's kinda unexpected but still very very nice. 👍

@zonyitoo
Copy link
Owner

I think the previous logs about Invalid Reads and Invalid Writes are actually exist because we write to some variables allocated on the stack of the Coroutine after we dropped the stack, as described in that issue. But I think the problem is already solved during the migration to context-rs 1.0.

@zonyitoo zonyitoo mentioned this issue Feb 15, 2016
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants