Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance with Async #7

Open
jawline opened this issue Jul 5, 2021 · 2 comments
Open

Performance with Async #7

jawline opened this issue Jul 5, 2021 · 2 comments

Comments

@jawline
Copy link
Contributor

jawline commented Jul 5, 2021

@mt-caret

Adding this here as a central point for discussion rather than X-linking between different repositories.

When you have some time it would be great if you could validate some experimental results and check the examples in the prototype-async backend. In particular I'm seeing higher system usage in read-heavy workloads with Async. I can't figure out if this is an increase in load due to buffering or issues driven by file-IO threading.

If it is a true gain then the backend might have a notable impact on throughput in IO bound applications with streamed computation, since CPU's tend to be tuned to boost workloads on a single core.

With IO uring:

async_io_uring> time dune exec examples/demo_uring.exe -- -path ~/Downloads/test_file -outpath ~/Downloads/test_file2 -nowrite
open parameters: 0 0 
Starting up
Determined!
Done
dune exec examples/demo_uring.exe -- -path ~/Downloads/test_file -outpath    10.15s user 0.98s system 58% cpu 18.968 total

With Async:

async_io_uring> time dune exec examples/demo_std.exe -- -path ~/Downloads/test_file -outpath ~/Downloads/test_file2 -nowrite
Determined!          
Done
dune exec examples/demo_std.exe -- -path ~/Downloads/test_file -outpath    17.68s user 12.96s system 112% cpu 27.258 total

Tested on a 10GB file with hot caches.

In particular, I'm using the bytes_ variants of methods with async, I don't think the increased memcpy should CPU usage all that much though. Perf top shows an increase in kernel locking, but it's too illuminating.

Kernel traces (A bit misleading since uring will have the read path obscured by the async kernel, wheras in the syscall path the read path is visible in the trace):

Async Standard

Uring

Would be great to get your thoughts and make sure this isn't just PEBKAC.

@mt-caret
Copy link
Owner

mt-caret commented Jul 8, 2021

I've taken a (very quick) look at the code and have a few suggestions:

  • compiling with flambda:
$ opam switch
#  switch          compiler                       description
→  4.11.1+flambda  ocaml-variants.4.11.1+flambda  4.11.1+flambda
   default         ocaml-base-compiler.4.12.0     default
  • enabling --profile=release
  • using Bytes.unsafe_get

With these, I get numbers like this:

$ /usr/bin/time -v dune exec --profile=release examples/demo_uring.exe -- -path ~/test_file_10g -outpath ~/test_file_1g_out -nowrite
Starting up
Determined!
(crc 1369039105800)
        Command being timed: "dune exec --profile=release examples/demo_uring.exe -- -path /home/ubuntu/test_file_10g -outpath /home/ubuntu/test_file_1g_out -nowrite"
        User time (seconds): 6.08
        System time (seconds): 2.09
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.18
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 27616
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 9435
        Voluntary context switches: 23
        Involuntary context switches: 104
        Swaps: 0
        File system inputs: 0
        File system outputs: 760
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
$ /usr/bin/time -v dune exec --profile=release examples/demo_std.exe -- -path ~/test_file_10g -outpath ~/test_file_1g_out -nowrite
Determined!
(crc 1369039105800)
        Command being timed: "dune exec --profile=release examples/demo_std.exe -- -path /home/ubuntu/test_file_10g -outpath /home/ubuntu/test_file_1g_out -nowrite"
        User time (seconds): 12.20
        System time (seconds): 6.02
        Percent of CPU this job got: 120%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.17
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 27568
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 9358
        Voluntary context switches: 871398
        Involuntary context switches: 70
        Swaps: 0
        File system inputs: 0
        File system outputs: 760
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

It seems like io_uring massively reduces voluntary context switches, as expected 😁

@jawline
Copy link
Contributor Author

jawline commented Jul 8, 2021

Oh wow, I forgot to enable release optimizations, obviously in a CPU bound workload these would have a significant impact.

When you have a chance could you try run the tcp examples and pipe a file into them via netcat. They essentially do the same thing but via TCP, locally I can't validate any difference in speed between the two. If you get the same result then I think it makes sense to try just retrofitting file IO using uring onto the existing loop rather than a whole hog async event loop rewrite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants