Bogus data being marked for GC sweeps under Rubinius #1047

yorickpeterse · 2014-02-12T19:07:19Z

Problem: in certain cases it seems that Nokogiri is marking (using the mark function found at https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_node.c#L15) bogus data to be sweeped by the GC. Under MRI this seems to magically work but under Rubinius this triggers a segfault.

A script that reproduces this is as following: https://gist.github.com/YorickPeterse/29bdbab31c0cabdc66b2. When running this make sure you're using Rbx 2.2.5 and Nokogiri from Git source.

Observations so far:

Sometimes the mark() function seems to handle xmlElementType structures which are then marked at https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_node.c#L23. This in some cases triggers a segfault.
With the script above it appears that null points are being sent to the GC.
Both behaviours appear at random, though they occur pretty quickly when running the above script.

Samples from gdb:

(gdb) frame
#0  0x00007ffff63248cc in mark (node=0x25247e0) at xml_node.c:18
18        if(doc->type == XML_DOCUMENT_NODE || doc->type == XML_HTML_DOCUMENT_NODE) {
(gdb) info locals
doc = 0x0
(gdb) info args
node = 0x25247e0
(gdb) p node
$5 = (xmlNodePtr) 0x25247e0
(gdb) p doc
$6 = (xmlNodePtr) 0x0
(gdb) p *node
$7 = {_private = 0x0, type = 0, name = 0x1f1f360 "", children = 0x0, last = 0x2524830, parent = 0x0, next = 0x0, prev = 0x0, doc = 0x0, ns = 0x21, 
  content = 0x79726f6765746163 <error: Cannot access memory at address 0x79726f6765746163>, properties = 0x7ffff66d8600 <__memalign_hook>, nsDef = 0x0, psvi = 0x81, line = 0, 
  extra = 0}
(gdb) p *doc
Cannot access memory at address 0x0
(gdb)

The corresponding backtrace:

(gdb) bt
#0  0x00007ffff63248cc in mark (node=0x25247e0) at xml_node.c:18
#1  0x00000000006b2151 in rubinius::Data::Info::mark (this=<optimized out>, t=0x7fff689c8d48, mark=...) at vm/builtin/data.cpp:155
#2  0x000000000071ddb2 in rubinius::GarbageCollector::scan_object (this=0x1554a30, obj=0x7fff689c8d48) at vm/gc/gc.cpp:108
#3  0x000000000071fb0d in walk_pointers (this=<optimized out>, addr=..., mark=...) at /home/yorickpeterse/Private/Projects/ruby/rubinius/vm/gc/immix.hpp:78
#4  process_mark_stack (this=<optimized out>, alloc=..., count=<optimized out>) at /home/yorickpeterse/Private/Projects/ruby/rubinius/vm/util/immix.hpp:1095
#5  rubinius::ImmixGC::process_mark_stack (this=<optimized out>, count=<optimized out>, this=<optimized out>, count=<optimized out>) at vm/gc/immix.cpp:382
#6  0x00000000007212ce in rubinius::ImmixMarker::perform (this=0x1a96210, state=0x7fffefffeec0) at vm/gc/immix_marker.cpp:159
#7  0x0000000000721193 in rubinius::immix_marker_tramp (state=0x7fffefffeec0) at vm/gc/immix_marker.cpp:18
#8  0x00000000006f90fd in rubinius::Thread::in_new_thread (ptr=0x1a732d0) at vm/builtin/thread.cpp:259
#9  0x00007ffff72830a2 in start_thread () from /usr/lib/libpthread.so.0
#10 0x00007ffff641bd1d in clone () from /usr/lib/libc.so.6

I doubt this particular issue is caused due to threading, it's more likely that this is an issue similar to the one described in #939. The use of threading simply increases the amount of GC activity which in turn causes this issue to surface much quicker.

The text was updated successfully, but these errors were encountered:

yorickpeterse · 2014-02-12T19:09:38Z

Extra info: Rubinius has two GC systems in place, Immix and Baker. Baker is used for the young generation, Immix for the mature generation (if I'm not mistaken). In a production application the above issue mainly occurs in the Baker system whereas in the script mentioned above the issue primarily occurs in the Immix system.

yorickpeterse · 2014-02-12T19:31:42Z

http://hastebin.com/raw/rolenakale This pastie contains a sample of the backtraces of all threads when this issue would occur in a production application. The relevant bit is the following:

Thread 10 (Thread 0x7fffd8fec700 (LWP 18605)):
#0  rubinius::BakerGC::saw_object (this=0x1554880, obj=0x0) at vm/gc/baker.cpp:86
#1  0x000000000070af22 in rb_gc_mark (ptr=140736689064640) at vm/capi/gc.cpp:44
#2  0x00007ffff5e11932 in mark (node=0x7fffd04f4ca0) at xml_node.c:23
#3  0x00000000006b2151 in rubinius::Data::Info::mark (this=<optimized out>, t=0x7fffda3a7dd8, mark=...) at vm/builtin/data.cpp:155
#4  0x000000000071ddb2 in rubinius::GarbageCollector::scan_object (this=0x1554880, obj=0x7fffda3a7dd8) at vm/gc/gc.cpp:108
#5  0x000000000071a1f7 in rubinius::BakerGC::collect (this=0x1554880, data=0x7fffd8fccc48, stats=0x7fffd8fccc28) at vm/gc/baker.cpp:215
#6  0x0000000000656c31 in rubinius::ObjectMemory::collect_young (this=0x15543b0, state=0x7fffd8fcccf8, data=0x7fffd8fccc48, stats=0x7fffd8fccc28) at vm/object_memory.cpp:574
#7  0x00000000006569ad in rubinius::ObjectMemory::collect_maybe (this=0x15543b0, state=0x7fffd8fcccf8, gct=..., call_frame=<optimized out>) at vm/object_memory.cpp:531
#8  0x00000000006a1be5 in rubinius::VM::collect_maybe (this=0x1554880, gct=..., call_frame=0x719c90 <rubinius::BakerGC::saw_object(rubinius::Object*)>) at vm/vm.cpp:310
#9  0x00000000006add63 in collect_and_allocate (state=<optimized out>, gct=..., self=<optimized out>, self=<optimized out>, self=<optimized out>, self=<optimized out>, 
    self=<optimized out>, calling_environment=<optimized out>, state=<optimized out>, gct=..., self=<optimized out>, calling_environment=<optimized out>)
    at vm/builtin/class.cpp:66
#10 rubinius::(anonymous namespace)::allocate_packed (state=0x7fffc40008c0, gct=..., self=0x7fffd8fccd58, calling_environment=0x7fffd8fcd650) at vm/builtin/class.cpp:114

This shows that it occurs both in the young and mature generation.

Corresponding Rubinius issue: rubinius/rubinius#2908

yorickpeterse · 2014-02-12T19:33:45Z

Backtraces of the repro script: http://hastebin.com/raw/pexacaladi

This indirectly fixes one of the issues exposed in the following: sparklemotion/nokogiri#1047 #2844 This is not the root cause of the Nokogiri-relate segvs, which appear to be Nokogiri (libxml2) releasing data for a node while a Data object wrapping it is still reachable and hence is having mark() called on it.

brixen · 2014-02-21T17:25:48Z

I ran the repro script that @yorickpeterse provided, with one change: I upped the threads to 100.

I ran this on an Ubuntu 12.04 vbox machine on rubinius/rubinius@56d2edb with rubinius/rubinius@eba5d40 and rubinius/rubinius@6da3885 applied.

I build Rubinius and Nokogiri without optimizations and with debugging symbols enabled.

I instrumented Nokogiri (https://gist.github.com/brixen/9138421#file-nokogiri_1047-diff) to track where libxml2 pointers were being wrapped and where xmlFree* was being called. I also added instrumentation around the malloc, realloc, and free functions that Nokogiri is configuring libxml2 with.

I instrumented Rubinius (https://gist.github.com/brixen/9138421#file-rbx-diff) to track Data objects being finalized.

I've run multiple times with the same results:

A pointer is allocated
The pointer is wrapped by Nokogiri
The pointer is marked
The pointer is free'd, with every indication being it is free'd by libxml2
The Ruby object wrapping the pointer is still reachable, and an attempt to mark the pointer causes a SEGV

This gist is the output of running ack on the log file for the pointer that causes the fault: https://gist.github.com/brixen/9106890. I built the instrumentation up over time, but the same pattern is visible. This run is the last: https://gist.github.com/brixen/9106890#file-gistfile7-txt

The pointer causing the fault is a member of a NodeSet and is wrapped at this point in Nokogiri code: https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_node_set.c#L268

I attempted to use xmlCopyNode, passing 1 for the extended parameter, but that function does not completely copy the node. In particular, the node->doc was 0x0 in the copy, as were parent, next, previous, IIRC.

The fundamental issue appears to be that Nokogiri is assuming the lifetime of the libxml2 object that it is wrapping bounds the lifetime of the Ruby object in which the pointer is wrapped. This assumption appears to be invalid in this particular case, which causes me to wonder how many other cases are invalid as well, but by some happy coincidence, usually work. If this is the case, Nokogiri is fundamentally unsound.

brixen · 2014-02-21T17:27:15Z

Also, I have all those logs if they are useful, but they are really big (each ~300mb gzip'd).

This makes whatever the issue is in sparklemotion/nokogiri#1047 #2844 to not manifest during more than 1 hour of 100 threads running in the repro script in the Nokogiri ticket. If a bug never appears, does it exist?

brixen · 2014-02-22T01:16:46Z

There are two scenarios that seem plausible.

Rubinius has a concurrent mark thread for the mature generation. It's possible that Nokogiri calls a libxml2 function that releases a tree, ultimately calling free on a node that the mark thread reaches after libxml2 calls free, but before Nokogiri finishes making the Ruby Data objects unreachable.

The other scenario is that Nokogiri is not thread safe in some respect.

Or it could be a combination of the two. Note that here, free is called 4 times on an address with no intervening allocs, and here, it appears alloc is returning the same address twice with no intervening free, then free is called twice. The latter could be the result of interleaving of output, but the former is hard to explain.

Running Rubinius with the -Xgc.immix.concurrent=no, the issue does not manifest after more than 30 min of processing.

Ultimately, I added a C-API lock specifically for Nokogiri. This also appears to prevent the issue from manifesting in over 60 min of processing.

headius · 2014-02-22T07:10:39Z

VMs like the JVM that have concurrently-executable garbage collectors generally do not allow those collectors to run concurrently with C calls that may be accessing managed objects. In fact, under most circumstances JVMs do not allow GC to run at all while unmanaged code is executing, unless you know the proper VM-specific incantations to allow that to happen.

If Rubinius is actually running GC concurrent to unmanaged code, this will not be the only issue to come up. Part of the implicit contract of MRI's C API is that unless GC is explicitly invoked, objects in hand will live until execution returns to the VM. The JVM's JNI API makes the same guarantee for the same reasons: you can't GC objects that native code might still be using.

I would recommend not allowing Rubinius's GC to run during the lifetime of a C extension downcall for all C extensions.

brixen · 2014-02-23T02:06:35Z

Rubinius was not collecting an object. It was marking it, per the mark function set on the Data object.

I'd argue it's a bug to wrap a structure that is being free'd independent of the available free function on the Data object.

headius · 2014-02-23T07:42:15Z

@brixen Perhaps I'm confused about your GC... isn't marking part of the GC cycle? Or is there some other "marking" going on that's unrelated to GC?

headius · 2014-02-23T07:58:02Z

Perhaps I can clarify things a bit. What I meant was that within a given C downcall, Rubinius should probably guarantee that no concurrent interaction of GC mechanisms (marking, sweeping, compacting) can occur while that C call is active.

Further, the MRI C API guarantees (implicitly, I'll grant) that VM-level mechanisms like GC will not interact with any objects (excessive, perhaps, but illlustrative when you are attempting to emulate that API) while a C extension is executing. I think Rubinius will continue to run into problems with C extensions as long as it allows VM-level processes to run concurrently with C extension downcalls.

headius · 2014-09-16T16:58:36Z

Was the Rubinius team able to sort out exactly what's causing this? I have heard vague claims that "libml isn't thread-safe" or "nokogiri isn't thread-safe" but nobody seems to have recorded exactly what makes them thread-safe. That would be very useful information to give to either the Nokogiri team or the libxml maintainers.

If this is fixable in Nokogiri, it should be fixed. Need more information on what to fix.

If it's not fixable in Nokogiri but exposes a problem in libxml, the libxml maintainers should be notified and this bug should be closed (and libxml should be fixed, but that's somewhat out of our control).

If this is not fixable in Nokogiri or libxml, then it would indicate something's wrong in Rubinius, and this bug should be closed.

headius · 2014-09-16T17:02:37Z

I would like some clarification about @brixen's "same result" comment above. Specifically:

A pointer is allocated
The pointer is wrapped by Nokogiri
The pointer is marked

What does this mean? A pointer is a memory address. Do you mean the object that wraps the pointer? And marked by whom?

The pointer is free'd, with every indication being it is free'd by libxml2

What are these indications? Do you have evidence to support this claim?

The Ruby object wrapping the pointer is still reachable, and an attempt to mark the pointer causes a SEGV

Again I'm unclear what you mean by "mark the pointer". If the pointer is allocated by libxml, then marking does not apply; it's a pointer to opaque memory libxml2 controls. I think you mean that the pointer gets freed independently of the DATA-related free function associated with the object? And you believe it's libxml doing it?

headius · 2014-09-16T17:53:31Z

I've tried to consume as much of this bug as possible, and I'm still confused why you think that the problem is in nokogiri or libxml.

In #1047 (comment) @brixen said that adding a lock around nokogiri native calls and turning off the concurrent GC fixes the problem. That could indicate a concurrency issue in Nokogiri or libxml, true. But it could also point at Rubinius's concurrent GC or a problem with the test (especially if the threads running in Rubinius are not using isolated objects from Nokogiri).

What's the status? Do the changes that fixed the issue for @brixen fix it in general?

flavorjones · 2015-01-09T19:25:50Z

I've been running the nonogiri.rb script linked to above for the better part of the day on my 4-core laptop and haven't been able to reproduce it yet.

I've tried:

Rubinius 2.4.1 / Nokogiri 1.6.2.1 / libxml 2.8.0 with patches
Rubinius 2.4.1 / Nokogiri 1.6.5 / libxml 2.9.2 with patches

Any suggestions on how I can more reliably reproduce it? I've tried increasing and decreasing the number of threads; what else should I be trying?

Possibly-notably, I don't seem to be able to saturate my CPUs. Not having much experience using Rubinius, I wonder if I'm missing something obvious?

Any help or advice would be greatly appreciated. Thanks.

-m

brixen · 2015-01-10T04:05:13Z

@flavorjones I added a C-API lock specifically for Nokogiri, so you'll get no parallelization in Nokogiri code when running threads on Rubinius https://github.com/rubinius/rubinius/blob/a7a6cb0052e3a3c19a3559f43cb1eb973251bfb5/vm/shared_state.cpp#L432

Perhaps @yorickpeterse has a better repro.

yorickpeterse · 2015-01-10T12:28:26Z

@brixen I tried running the script with and without the lock, neither crashed for me on Rbx master. I'll try with my actual apps in the coming week to see what happens there.

brixen · 2015-01-10T15:11:27Z

@yorickpeterse when you say, "with and without the lock", what do you mean? Nokogiri gets its own artisanal, free-range, environmentally friendly special lock now. How are you running, "without the lock"?

yorickpeterse · 2015-01-10T17:23:07Z

@brixen As in, I removed the CAPI_BLACK_LIST(nokogiri); bit then recompiled Rubinius from source and ran the script.

brixen · 2015-01-10T17:44:53Z

@yorickpeterse ah, that is good to know!

flavorjones · 2015-01-20T17:28:16Z

@yorickpeterse, any luck seeing this happen in the wild in the last 10 days?

yorickpeterse · 2015-01-20T19:59:45Z

@flavorjones Sorry, I haven't gotten to it yet. We're right in the middle of a big database migration this week, I'll try to take a look at it this Friday/next week.

flavorjones · 2015-01-31T19:19:44Z

@yorickpeterse - just bumping to keep this on your radar. Would love more info if you have it.

yorickpeterse · 2015-02-04T16:53:47Z

I tried my app again and it did crash once with a backtrace similar to the ones discussed above. Sadly I didn't set up logging of crashes properly making copy-pasting the error a total nightmare. I'm currently running the application under GDB but of course now it refuses to crash.

flavorjones · 2017-01-27T21:59:04Z

Running modern nokogiri under rbx-3, these crashes don't seem to occur (at least not in CI). I'd like to propose closing this, unless there's new and compelling information that I can use to investigate.

flavorjones · 2017-02-10T07:32:05Z

Closing. If anyone comes across this later, rbx3 builds with both system and vendored libraries are here, and look green:

yorickpeterse mentioned this issue Feb 25, 2014

Request for release 1.6.2 #1055

Closed

yorickpeterse mentioned this issue Apr 3, 2014

Nokogiri Pain Points yorickpeterse/oga#14

Closed

This was referenced Jan 1, 2015

GC Crash rubinius/rubinius#2844

Closed

GC Crash (potential Nokogiri issue) rubinius/rubinius#2957

Closed

flavorjones self-assigned this Jan 20, 2015

sodabrew mentioned this issue Jul 14, 2015

mysql2 related crashes when running specs rubinius/rubinius#2662

Closed

flavorjones added the state/will-close label Jan 27, 2017

flavorjones closed this as completed Feb 10, 2017

flavorjones added the topic/memory Segfaults, memory leaks, valgrind testing, etc. label Feb 2, 2020

flavorjones removed the state/will-close label Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bogus data being marked for GC sweeps under Rubinius #1047

Bogus data being marked for GC sweeps under Rubinius #1047

yorickpeterse commented Feb 12, 2014

yorickpeterse commented Feb 12, 2014

yorickpeterse commented Feb 12, 2014

yorickpeterse commented Feb 12, 2014

brixen commented Feb 21, 2014

brixen commented Feb 21, 2014

brixen commented Feb 22, 2014

headius commented Feb 22, 2014

brixen commented Feb 23, 2014

headius commented Feb 23, 2014

headius commented Feb 23, 2014

headius commented Sep 16, 2014

headius commented Sep 16, 2014

headius commented Sep 16, 2014

flavorjones commented Jan 9, 2015

brixen commented Jan 10, 2015

yorickpeterse commented Jan 10, 2015

brixen commented Jan 10, 2015

yorickpeterse commented Jan 10, 2015

brixen commented Jan 10, 2015

flavorjones commented Jan 20, 2015

yorickpeterse commented Jan 20, 2015

flavorjones commented Jan 31, 2015

yorickpeterse commented Feb 4, 2015

flavorjones commented Jan 27, 2017

flavorjones commented Feb 10, 2017

Bogus data being marked for GC sweeps under Rubinius #1047

Bogus data being marked for GC sweeps under Rubinius #1047

Comments

yorickpeterse commented Feb 12, 2014

yorickpeterse commented Feb 12, 2014

yorickpeterse commented Feb 12, 2014

yorickpeterse commented Feb 12, 2014

brixen commented Feb 21, 2014

brixen commented Feb 21, 2014

brixen commented Feb 22, 2014

headius commented Feb 22, 2014

brixen commented Feb 23, 2014

headius commented Feb 23, 2014

headius commented Feb 23, 2014

headius commented Sep 16, 2014

headius commented Sep 16, 2014

headius commented Sep 16, 2014

flavorjones commented Jan 9, 2015

brixen commented Jan 10, 2015

yorickpeterse commented Jan 10, 2015

brixen commented Jan 10, 2015

yorickpeterse commented Jan 10, 2015

brixen commented Jan 10, 2015

flavorjones commented Jan 20, 2015

yorickpeterse commented Jan 20, 2015

flavorjones commented Jan 31, 2015

yorickpeterse commented Feb 4, 2015

flavorjones commented Jan 27, 2017

flavorjones commented Feb 10, 2017