Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add support for disablging threading for emscripten #35176

Closed
joemarshall opened this issue Apr 17, 2023 · 6 comments · Fixed by #35672
Closed

[C++] Add support for disablging threading for emscripten #35176

joemarshall opened this issue Apr 17, 2023 · 6 comments · Fixed by #35672

Comments

@joemarshall
Copy link
Contributor

Describe the enhancement requested

I've built most of arrow (pyarrow and dependencies) for emscripten. It would be good to have a way to disable threading, as a lot of emscripten use is in browsers where threading may not be available.

At the moment I just put in dummy pthreads, which means some functionality in e.g.datasets fails because it assumes threading is available.

Component(s)

C++

@assignUser assignUser changed the title Arrow_threading [C++] Arrow_threading Apr 17, 2023
@westonpace
Copy link
Member

I'd say this is non-trivial but should be doable. Most non-I/O components generally have a way to disable threading. There are probably some exceptions however and we could clean those up.

I/O, on the other hand, tends to rely on the I/O thread pool being available. What does I/O look like in emscripten? I'm thinking of both "local disk I/O" (e.g. read a parquet file from local disk) and network I/O (e.g. read a parquet file from S3)?

Does emscripten have APIs for these things? Do they have async variants?

@joemarshall
Copy link
Contributor Author

On emscripten - in browser, local disk is memory based, and may or may not be synced to some kind of permanent storage (via an asynchronous syncfs call). Access to this disk is synchronous, but very quick because it is in memory. In node, you can use the real file system directly.

Network is weird, because it is hosted in browsers typically - for http / https one can call out to javascript to use the fetch api, which is asynchronous. Right now there's only async I/O for network with the exception of xmlhttprequest if you're in a web-worker, which is a hacky workaround for synchronous http access. In theory there's also a websockets wrapper which turns socket calls in C into websocket calls to the hosting server, but I don't know how well it works.

Basically, as I understand it, the potential in emscripten for arrow is:

  1. Local file system stuff should just work, if it can be read without threads (I had code reading a parquet file which worked okay)

  2. Network things (e.g reading from s3) would probably require porting work for things that work over http or websockets to work. Anything with a REST api or websockets api should be fine. Things that require direct connections or making servers won't work.

  3. I think this means that flight is going to be quite limited in its usefulness in webassembly, so I haven't even thought about compiling that.

Personally, for what I want, I just want core arrow with file support to work on emscripten - I think that is a decent starting point before getting into complexities.

@joemarshall
Copy link
Contributor Author

One thought, but if ThreadPool was to wrap a singleton SerialExecutor if ARROW_DISABLE_THREADING was set, things might just work well enough for a first attempt at an unthreaded build? I don't know whether there'd be any deadlocks in i/o vs compute tasks though?

@westonpace
Copy link
Member

One thought, but if ThreadPool was to wrap a singleton SerialExecutor if ARROW_DISABLE_THREADING was set, things might just work well enough for a first attempt at an unthreaded build? I don't know whether there'd be any deadlocks in i/o vs compute tasks though?

If everything is tasks and a single SerialExecutor is used then it should be ok (e.g. there shouldn't be deadlocks). It will just be very slow when doing I/O because we will be sitting there waiting on I/O with our one thread while the CPU sits completely idle.

That being said, I'm sure there are a few bugs / things that will need to be converted.

However, I'm not sure we can just wrap the global thread pools with a serial executor. The challenge with the serial executor is that it has to co-opt the calling thread. This means we create the executor when the call starts.

void ReadTable() {
  # Where DoReadTable is a function returning a Future
  # Note, we are not calling here, but passing it as a parameter
  RunInSerialExecutor(DoReadTable);
}

However, if we are combining I/O and CPU into a single pool...then it should be possible to create a special serial executor that creates a serial executor when a task is first submitted.

@westonpace
Copy link
Member

Something like...

void AddTask(Task t) {
  if (instance_) {
    instance_.AddTask(t);
  } else {
    instance_ = SerialExecutor(t);
    instance_ = nullptr;
  }
}

@joemarshall
Copy link
Contributor Author

I did some work on this - it's currently functional for quite a lot of things but failing some tests. I'm working through them.

It has to keep the concept of multiple executors because loads of the other logic relies on that, but all active tasks from any executors are dispatched in turn whenever anything waits for any task or future.

@kou kou changed the title [C++] Arrow_threading [C++] Add support for building with emscripten Aug 9, 2023
@kou kou changed the title [C++] Add support for building with emscripten [C++] Add support for disablging threading Aug 9, 2023
@kou kou changed the title [C++] Add support for disablging threading [C++] Add support for disablging threading for emscripten Aug 9, 2023
@kou kou closed this as completed in #35672 Aug 9, 2023
kou added a commit that referenced this issue Aug 9, 2023
…35672)

As previously discussed in #35176 this is a patch that adds an option `ARROW_ENABLE_THREADING`. When it is turned off, arrow threadpool and serial executors don't spawn threads, and instead run tasks in the main thread when futures are waited for.

It doesn't mess with threading in projects included as dependencies, e.g. multithreaded malloc implementations because if you're building for a non threaded environment, you can't use those anyway.

Basically where this is at is that it runs the test suite okay, and I think should work well enough to be a backend for pandas on emscripten/pyodide.

What this means is:
1) It is possible to use arrow in non-threaded emscripten/webassembly environments (with some build patches specific to emscripten which I'll put in once this is in)
2) Most of arrow just works, albeit slower in parts.

Things that don't work and probably won't:
1) Server stuff that relies on threads. Not a massive problem I think because environments with threading restrictions are currently typically also restricted from making servers anyway (i.e. they are web browsers)
2) Anything that relies on actually doing two things at once (for obvious reasons)

Things that don't work yet and could be fixed in future:
1) use of asynchronous file/network APIs in emscripten which would mean I/O could work efficiently in one thread.
2) asofjoin - right now the implementation relies on std::thread - it needs refactoring to work with threadpool like everything else in arrow, but I'm not sure I am expert enough in the codebase to do it well.
* Closes: #35176

Lead-authored-by: Joe Marshall <joe.marshall@nottingham.ac.uk>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kou kou added this to the 14.0.0 milestone Aug 9, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…ten (apache#35672)

As previously discussed in apache#35176 this is a patch that adds an option `ARROW_ENABLE_THREADING`. When it is turned off, arrow threadpool and serial executors don't spawn threads, and instead run tasks in the main thread when futures are waited for.

It doesn't mess with threading in projects included as dependencies, e.g. multithreaded malloc implementations because if you're building for a non threaded environment, you can't use those anyway.

Basically where this is at is that it runs the test suite okay, and I think should work well enough to be a backend for pandas on emscripten/pyodide.

What this means is:
1) It is possible to use arrow in non-threaded emscripten/webassembly environments (with some build patches specific to emscripten which I'll put in once this is in)
2) Most of arrow just works, albeit slower in parts.

Things that don't work and probably won't:
1) Server stuff that relies on threads. Not a massive problem I think because environments with threading restrictions are currently typically also restricted from making servers anyway (i.e. they are web browsers)
2) Anything that relies on actually doing two things at once (for obvious reasons)

Things that don't work yet and could be fixed in future:
1) use of asynchronous file/network APIs in emscripten which would mean I/O could work efficiently in one thread.
2) asofjoin - right now the implementation relies on std::thread - it needs refactoring to work with threadpool like everything else in arrow, but I'm not sure I am expert enough in the codebase to do it well.
* Closes: apache#35176

Lead-authored-by: Joe Marshall <joe.marshall@nottingham.ac.uk>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment