Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: catch writes to protected memory in mmapped arrays (fix #3434) #6877

Merged
merged 0 commits into from
Jul 23, 2014

Conversation

timholy
Copy link
Member

@timholy timholy commented May 18, 2014

Starting from the example in #6703,

julia> s = open("/tmp/a.bin", "r")
IOStream(<file /tmp/a.bin>)

julia> m = mmap_array(Float64, (10,1), s)
10x1 Array{Float64,2}:
 0.751866
 0.832087
 0.239773
 0.120287
 0.867469
 0.957427
 0.941344
 0.642588
 0.56932 
 0.464532

julia> m[3] = 2.0
ERROR: MemoryError()
 in setindex! at array.jl:305

julia> m = 5
5

julia> gc()

If desired I'd be happy to change the error message to something more descriptive (e.g., "Memory-access error. Did you try to set the value of a read-only mmapped array?"). I seem to remember some discussion about problems that can arise from more verbose error messages, which is why I started with something minimal.

Summary of design: store begin/end pointers of blocks of memory that are write-protected; if you get a SIGSEGV, check the address to see if it falls within one of these protected pages. If it doesn't, use the default SIGSEGV handling.

I can't test this on Windows or OSX (and IIUC the AppVeyor testing is still not finished?), but there's relatively little of this that's OS-specific.

@tkelman
Copy link
Contributor

tkelman commented May 18, 2014

IIUC the AppVeyor testing is still not finished?

It's more or less finished (I do need to tweak the binary download location done) but runs so slowly that only a tiny fraction of the tests can finish within the time limit.

The base test coverage for mmap isn't catching a lot of problems that are occurring on Windows in various packages - HDF5, Requests, possibly others. I've been too busy to hunt down the problems and come up with reduced test cases.

@timholy timholy changed the title Catch writes to protected memory in mmapped arrays (fix #3434) RFC: catch writes to protected memory in mmapped arrays (fix #3434) May 19, 2014
@timholy
Copy link
Member Author

timholy commented May 19, 2014

Updated to incorporate Keno's suggestion. The net effect is no change to base/mmap.jl, and just a few lines in src/init.c. In case anyone else likes the original better, I didn't squash and do a forced update; barring objections to the new implementation, I'll do that before merging.

@timholy
Copy link
Member Author

timholy commented May 26, 2014

Closed by caa4bfa

@timholy timholy closed this May 26, 2014
@timholy timholy deleted the teh/mmapsegfault branch May 26, 2014 22:58
@timholy timholy restored the teh/mmapsegfault branch May 26, 2014 23:15
@timholy
Copy link
Member Author

timholy commented May 26, 2014

Hmm, this may be causing Travis failures---the failures people are seeing (#6978) are in the same process that runs the file test. What's really odd is that (1) naturally I re-ran all the tests on my machine and they passed, and (2) the commit I pushed was just a rebase and squash of two commits, both of which passed Travis. I even restored this branch and rebuilt the Travis tests from back when they were two separate commits, just to check, and they are passing.

So this may not be the commit that is triggering the failures, but I am suspicious. Gotta run to dinner, but will check back on this later this evening.

@timholy
Copy link
Member Author

timholy commented May 28, 2014

More debugging. I took those same commits and re-committed them (without squashing) on top of a recent master. Even though the identical commits pass above, now I'm seeing test failures. This suggests either that something changed in base julia, or that failure is intermittent. EDIT: I should have pointed out that different Travis runs seem to fail for different reasons; in the linked example, one failed on the file test, the other completed file successfully and failed in the next test run in the same process.

Intermittent might suggest a memory problem, so I ran the test through valgrind (wow, is that slow). The only oddity compared to commenting out the

@test_throws MemoryError c[1] = 0   # read-only memory

line is

==6754== Warning: client switching stacks?  SP change: 0x9dacc10 --> 0x7feffe600
==6754==          to suppress, use: --max-stackframe=34177620464 or greater

That error message is (sortof) explained here.

I then wondered if this might also happen for our stack overflow detector. Sure enough, with bde82ca I got the same type of valgrind warning. That commit passed its Travis test, but on my own machine I had variable outcomes: several times it ran properly (for example, that time I ran it through valgrind), but once was able to crash/reboot my entire machine, and it looked like it was headed that way a second time before I killed the process. So we may still have some issues with our SEGV catching.

I'm afraid due to other commitments I need to stop working on this at least until the 2nd week of June, but I couldn't let this rest entirely without additional investigation.

@timholy
Copy link
Member Author

timholy commented Jul 8, 2014

OK, so a temporary account on Travis has let me test this more easily. Managed to trap a segfault:

/tmp/julia/share/julia/test$ gdb /tmp/julia/bin/julia-debug
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /tmp/julia/bin/julia-debug...done.
(gdb) run runtests.jl file
Starting program: /tmp/julia/bin/julia-debug runtests.jl file
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff28e7700 (LWP 19007)]
     * file
[New Thread 0x7fffeb82a700 (LWP 19010)]
[New Thread 0x7fffeae29700 (LWP 19011)]
[New Thread 0x7fffea428700 (LWP 19012)]
[New Thread 0x7fffe9a27700 (LWP 19013)]

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff51013f6 in ?? ()
(gdb) bt
#0  0x00007ffff51013f6 in ?? ()
#1  0x00007fffffffb8f0 in ?? ()
#2  0xe03fc30fb6401b00 in ?? ()
#3  0x00007fffffffb8b0 in ?? ()
#4  0x00007ffff51013c3 in ?? ()
#5  0x00007fffffffb8f0 in ?? ()
#6  0x00007ffff6ec2dfc in jl_apply (f=0x7ffff6efe1a4, args=0x7fffffffb8b0, 
    nargs=3762275087) at julia.h:975
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) 

Due to the stack corruption even that frame 6 looks rather useless (that's a lot of arguments!). Note that the file test here includes the

@test_throws MemoryError c[1] = 0  # read-only memory

line, which is what (intermittently) triggers this segfault.

One option I see is to commit this patch without including the test, so that at least we trap the error. That seems like an improvement over where we are now. We could expand the error message to say "Your session is probably corrupt, a restart is recommended."

As a point of reference, we've never had a test for the ability to catch stack overflows, either. When I add one, I intermittently get a similar error to what I'm experiencing here. So this plan would simply put the two on par.

@timholy
Copy link
Member Author

timholy commented Jul 23, 2014

Bump. #7708 reminded me that stack overflows are probably not in any better shape.

With this, at least we should usually get an error; if that error tells the user why, and that s/he probably needs to restart Julia due to stack corruption, at least that seems to be an improvement on a segfault.

If we want this for 0.3, it should be merged soon. Or just wait for 0.4. Let me know, and I'll remove the test, squash, and merge.

@JeffBezanson
Copy link
Member

Yes, let's remove the test and merge.

@timholy timholy merged commit d4f8c31 into master Jul 23, 2014
@timholy timholy deleted the teh/mmapsegfault branch July 24, 2014 00:05
@timholy
Copy link
Member Author

timholy commented Jul 24, 2014

Done.

@timholy timholy mentioned this pull request May 19, 2015
@quinnj quinnj restored the teh/mmapsegfault branch May 19, 2015 18:52
@quinnj quinnj deleted the teh/mmapsegfault branch May 19, 2015 18:52
@quinnj quinnj restored the teh/mmapsegfault branch May 19, 2015 18:52
@quinnj quinnj deleted the teh/mmapsegfault branch May 19, 2015 18:52
@quinnj
Copy link
Member

quinnj commented May 19, 2015

Hmmm......why can't I see any commits/changes with this merge? Even after restoring the branch?

@quinnj
Copy link
Member

quinnj commented May 19, 2015

NM, found it

@tkelman
Copy link
Contributor

tkelman commented Jun 13, 2015

Github (and/or Jacob) did something funky here. d4f8c31 does not look like the right commit, can someone find the right commit for this?

@yuyichao
Copy link
Contributor

@tkelman caa4bfa ? #6877 (comment) ?

@yuyichao
Copy link
Contributor

Or 3b36d85

These look like the only relevant commits from

git log --before='Jul 23 2014' --after='May 18 2014' --author=holy

@tkelman
Copy link
Contributor

tkelman commented Jun 13, 2015

Ah thanks, yes, git log --grep=3434 helps, that first one was reverted in 87c3e38, then in July it was committed again but without the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants