Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharing the event loop leads to issues with processing foreign handles/cleanup #721

Closed
rdw-software opened this issue Oct 11, 2024 · 12 comments

Comments

@rdw-software
Copy link
Contributor

Following-up with #718 (comment)):

The SEGFAULTs mentioned here are related to a specific use case, which seemingly violates a core assumption in the luv code.

Handles ( uv_handle_t) managed by luv are allocated in luv_setup_handle, where they store the luv_handle_t as userdata:

data = (luv_handle_t*)malloc(sizeof(*data));

luv/src/lhandle.c

Lines 50 to 55 in 5eb6543

data->ref = luaL_ref(L, LUA_REGISTRYINDEX);
data->callbacks[0] = LUA_NOREF;
data->callbacks[1] = LUA_NOREF;
data->ctx = ctx;
data->extra = NULL;
data->extra_gc = NULL;

This works great when all handles are created in Lua and luv is the only user of the event loop. But when an application that embeds luv wants to integrate other libraries (that also support libuv) and then shares the uv_loop_t with luv, things can get a little dicey:

  • luv code was clearly written under the assumption that all handles have a luv_handle_t referenced by handle->data
  • This assumptions doesn't hold when the handle was created externally (data may be NULL or completely unrelated)
  • During uv.walk(), luv_walk_cb reads the data member to access the Lua registry based on the stored data->ref
  • On shutdown, loop_gc runs the loop until all handles are closed, again assuming they were created by luv (and need closing)
  • There are probably other scenarios where requests or handles might be created elsewhere and put on the same loop

The application might want to use libuv APIs directly (from C and/or the FFI), so a high degree of interoperability would be valuable. Beause luv_loop() is a public API I assumed that making use of the existing event loop was fine, even in the same Lua state. Some popular libraries that support sharing event loops would be libwebsockets and uSockets (used in uWebSockets). The latter is what I use.

I encountered the problem because the library creates several handles that trigger the assertion/crash when processed by luv:

luv/src/loop.c

Lines 92 to 98 in 5eb6543

static void luv_walk_cb(uv_handle_t* handle, void* arg) {
lua_State* L = (lua_State*)arg;
luv_handle_t* data = (luv_handle_t*)handle->data;
// Sanity check
// Most invalid values are large and refs are small, 0x1000000 is arbitrary.
assert(data && data->ref < 0x1000000);

Here the foreign handles were being processed as if their data pointed to a luv_handle_t reference, which it did not. I've since experimented with workarounds that allow skipping the irrelevant handles, but the lifetime/GC logic still choked on them.

I'm not sure if this is a supported use case or if you'd rather see embedders adopt a different approach. Potential solutions:

  • The application could create and manage multiple event loops, at the cost of increased complexity/unexpected behavior
  • The application might create multiple Lua states and retrieve a separate luv_loop for each, again increasing complexity
  • luv could tag the handles that it manages and then skip those created externally, at the cost of a larger footprint for all users
  • luv could provide hooks that optionally allow the application to handle problematic aspects itself, increasing complexity
  • luv could attempt to use the Lua registry instead of tagging handles, but this didn't seem to work reliably in my testing

There could be better ways of handling this. Or maybe the library already supports sharing the loop, but I didn't see how. Any ideas?

@zhaozg
Copy link
Member

zhaozg commented Oct 11, 2024

luv/src/luv.h

Lines 118 to 122 in 5eb6543

/* Set or clear an external uv_loop_t in a lua_State
When using a custom/external loop, this must be called before luaopen_luv
(otherwise luv will create and use its own loop)
*/
LUALIB_API void luv_set_loop(lua_State* L, uv_loop_t* loop);

We provide luv_set_loop for that, the neovim show howto embeds luv, with a external loop. https://github.com/neovim/neovim/blob/c4762b309714897615607f135aab9d7bcc763c4f/src/nvim/lua/executor.c#L600

more info please look at #331 #345

@rdw-software
Copy link
Contributor Author

Thanks! But the handles aren't treated any differently, nor does luv_set_callback help in this case. As soon as anything not originating with luv is put on the loop, processing it leads to crashes. Regardless of who owns/manages the uv_loop_t:

  1. Someone creates the loop itself (doesn't matter where this is done)
  2. Loop is assigned to luv and will be used to run all of its handles (as expected)
  3. Loop is assigned to uSockets, or some other third-party code for that matter
  4. This code puts additional handles onto the loop, which have no corresponding luv_handle_t
  5. uv.walk() invokes the luv_walk_cb for all handles, including the ones luv didn't create
  6. Crash or assertion failure ensues because the handles aren't recognized as valid by luv
  7. Bonus: They also cause problems when luv later tries to close them (GC/shutdown sequence)

I tried this (somewhat simplified) approach:

uv_loop_t shared_loop;
uv_loop_init(&shared_loop);
luv_set_loop(L, &shared_loop); // Stores the loop in the context (OK)
// Calling luaopen_luv afterwards does make it use the shared loop (OK)

// Assign loop to another library, create requests independently, or what have you
// uv.walk() from Lua, uv_walk from C, or do any other processing of the handles
// --> Will likely crash here unless luv ignores/is made aware of the foreign handles

uv_run(&shared_loop, UV_RUN_DEFAULT); // Doesn't matter where this happens, or if it happens at all (?)

NeoVIM seems to create its own main loop on top of uv_loop_t. I don't know that this approach would be desirable here.

@Bilal2453
Copy link
Contributor

NeoVIM seems to create its own main loop on top of uv_loop_t. I don't know that this approach would be desirable here.

I haven't looked at anything else yet, but yes indeed that is what Zhaozg is suggesting, drive luv's event loop using the other event loop, I don't believe integrating another event loop into Luv's, or letting it manage handles created by something else is supported nor am I sure if it is a goal. If I remember correctly even libuv has issues handling that last time I looked at integrating cqueues.

Luv still needs to know how to GC a handle (or if not to gc it), and will require other references while operating on a handle, I don't think it is really possible to change how all of this works as of now.

@zhaozg
Copy link
Member

zhaozg commented Oct 12, 2024

Crash or assertion failure ensues because the handles aren't recognized as valid by luv

Please know, use luv to process handler created by luv, if handler create by others lib, that will damage the luv_handle_t object(uv_handle_t->data) wrap mechanism. https://github.com/luvit/luv/blob/master/src/lhandle.c#L19.

@rdw-software
Copy link
Contributor Author

There's only one event loop (uv_loop_t), which is used by luv and potentially other code. That's the goal, at least.

I've looked through the luv sources some more and with luv_set_loop the GC problem is already solved. Indeed, the only function that should ever encounter the foreign handles is uv.walk() because it processes all the handles that exist on the shared loop (but crashes if any were created externally). That means there's no need for sweeping changes.

I've made it work by replacing uv.walk with a wrapper that "tags" the foreign handles (handle->data = NULL), calls uv_walk from luv, and finally restores the original pointer. That's the simplest and least hacky option I could come up with, but it does require one tiny change to luv itself. The assert in loop.c would have to be turned into a conditional return:

static void luv_walk_cb(uv_handle_t* handle, void* arg) {
  lua_State* L = (lua_State*)arg;
  luv_handle_t* data = (luv_handle_t*)handle->data;

  // Sanity check
  // Most invalid values are large and refs are small, 0x1000000 is arbitrary.
  if(!data || data->ref >= 0x1000000) return; // <--- This is the only change

  lua_pushvalue(L, 1);           // Copy the function
  luv_find_handle(L, data);      // Get the userdata
  data->ctx->cb_pcall(L, 1, 0, 0);  // Call the function
}

This effectively skips any handles that weren't created from within luv, and offloads everything else to the application.

A slightly less cryptic alternative would be to introduce a flag (tag value) that makes the intention clearer:

  // Some constant like this might be defined in luv.h - name and ptr value debatable:
  #define LUVF_EXTERNAL_HANDLE  ((void*)0xDEADBEEF) // Or another "magic" value

With this, an early exit could be used to skip invoking the callback for handles that luv doesn't manage itself:

static void luv_walk_cb(uv_handle_t* handle, void* arg) {
  lua_State* L = (lua_State*)arg;
  luv_handle_t* data = (luv_handle_t*)handle->data;
  
  if(data == LUVF_EXTERNAL_HANDLE) return; // Ignore external loop participants (no safety guarantees)
  
  // Sanity check
  // Most invalid values are large and refs are small, 0x1000000 is arbitrary.
  assert(data && data->ref < 0x1000000); // <--- Original assertion still exists

  lua_pushvalue(L, 1);           // Copy the function
  luv_find_handle(L, data);      // Get the userdata
  data->ctx->cb_pcall(L, 1, 0, 0);  // Call the function
}

Would you accept a non-disruptive change like that? I can diff-patch luv, but I'd rather PR it and avoid future headaches.

@Bilal2453
Copy link
Contributor

I feel like it makes sense for uv.walk to not just crash on you if it encounters a foreign handle, and rather just handle what it knows to handle. I think I prefer removing the assertion solution.

@squeek502
Copy link
Member

squeek502 commented Nov 5, 2024

This same assertion was tripped in a CI run of our tests here: https://github.com/luvit/luv/actions/runs/11649845591/job/32437850489

ok 126 thread - getaffinity, setaffinity
default priority	0
hello world from thread
priority in thread	0
lua: /home/runner/work/luv/luv/src/loop.c:98: luv_walk_cb: Assertion `data && data->ref < 0x1000000' failed.

It occurred during the "getpriority, setpriority" test, and a re-run fixed it so it is an intermittent failure.

Might ultimately be unrelated to what's discussed in this issue, but our own tests tripping this assertion seems like bad news.

@Bilal2453
Copy link
Contributor

Bilal2453 commented Nov 5, 2024

Memory corruption of some kind? Our Valgrind tests have also been failing for a while on some tests

@squeek502
Copy link
Member

Ah yeah, same failure here and (presumably, logs have been cleared) here

Not sure why it's only getting tripped when run under Valgrind. Will try to reproduce this locally when I get a chance.

@Bilal2453
Copy link
Contributor

Bilal2453 commented Nov 5, 2024

Okay to be honest this assertion may be useful for testing purposes while not perfect for production use. Instead of removing it entirely we should move it into the tests, somehow.

@squeek502
Copy link
Member

squeek502 commented Nov 5, 2024

Well, there are two separate issues I think:

  1. As you said, this assertion is useful for our own testing. Tripping it means something is going wrong.
  2. Tripping this assertion is unavoidable for some use cases that we may want to support

How we resolve both should be considered together. As I said in #727 (comment), the changes in #723 could turn the intermittent failure into a silent failure.

Anyway, I've split the intermittent failure into its own issue here: #730

@rdw-software
Copy link
Contributor Author

Resolved: Foreign handles should be ignored as of #733. GC problems can be avoided by using luv_set_loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants