Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark results for storage to SSD with/without liburing #1123

Open
boxerab opened this issue Apr 11, 2024 · 15 comments
Open

Benchmark results for storage to SSD with/without liburing #1123

boxerab opened this issue Apr 11, 2024 · 15 comments

Comments

@boxerab
Copy link

boxerab commented Apr 11, 2024

Hello, here are some benchmark results I have compiled for disk storage with/without liburing testing both buffered and direct IO.

Kernel: 6.7
file system: xfs
disk : nvme ssd
CPU: 48 thread AMD

The benchmark essentially takes a series of small buffers, does some work on each buffer, then stores the results to disk. fsync is called at the end. Unfortunately, liburing is slightly slower than blocking I/O

The benchmark project itself can be found here.

Note : 0 or 1 below represents false or true

Run with concurrency = 2, store to disk = 0, direct = 0, use uring = 0
 : 6213.034517 ms
Run with concurrency = 2, store to disk = 1, direct = 0, use uring = 0
 : 6194.979951 ms
Run with concurrency = 2, store to disk = 1, direct = 0, use uring = 1
 : 6635.473980 ms
Run with concurrency = 2, store to disk = 1, direct = 1, use uring = 0
 : 5723.335242 ms
Run with concurrency = 2, store to disk = 1, direct = 1, use uring = 1
 : 5159.495408 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 4, store to disk = 0, direct = 0, use uring = 0
 : 2668.155356 ms
Run with concurrency = 4, store to disk = 1, direct = 0, use uring = 0
 : 3611.529825 ms
Run with concurrency = 4, store to disk = 1, direct = 0, use uring = 1
 : 3696.223504 ms
Run with concurrency = 4, store to disk = 1, direct = 1, use uring = 0
 : 2901.635248 ms
Run with concurrency = 4, store to disk = 1, direct = 1, use uring = 1
 : 2663.072939 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 6, store to disk = 0, direct = 0, use uring = 0
 : 1723.227584 ms
Run with concurrency = 6, store to disk = 1, direct = 0, use uring = 0
 : 2666.758831 ms
Run with concurrency = 6, store to disk = 1, direct = 0, use uring = 1
 : 2632.156388 ms
Run with concurrency = 6, store to disk = 1, direct = 1, use uring = 0
 : 1981.299622 ms
Run with concurrency = 6, store to disk = 1, direct = 1, use uring = 1
 : 1807.754762 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 8, store to disk = 0, direct = 0, use uring = 0
 : 1330.179655 ms
Run with concurrency = 8, store to disk = 1, direct = 0, use uring = 0
 : 2095.117032 ms
Run with concurrency = 8, store to disk = 1, direct = 0, use uring = 1
 : 2122.957382 ms
Run with concurrency = 8, store to disk = 1, direct = 1, use uring = 0
 : 1523.161476 ms
Run with concurrency = 8, store to disk = 1, direct = 1, use uring = 1
 : 1367.067723 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 10, store to disk = 0, direct = 0, use uring = 0
 : 1061.972147 ms
Run with concurrency = 10, store to disk = 1, direct = 0, use uring = 0
 : 1899.936051 ms
Run with concurrency = 10, store to disk = 1, direct = 0, use uring = 1
 : 1877.989199 ms
Run with concurrency = 10, store to disk = 1, direct = 1, use uring = 0
 : 1244.832314 ms
Run with concurrency = 10, store to disk = 1, direct = 1, use uring = 1
 : 1099.402518 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 12, store to disk = 0, direct = 0, use uring = 0
 : 874.523286 ms
Run with concurrency = 12, store to disk = 1, direct = 0, use uring = 0
 : 1701.024333 ms
Run with concurrency = 12, store to disk = 1, direct = 0, use uring = 1
 : 1649.511243 ms
Run with concurrency = 12, store to disk = 1, direct = 1, use uring = 0
 : 1064.506254 ms
Run with concurrency = 12, store to disk = 1, direct = 1, use uring = 1
 : 932.916314 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 14, store to disk = 0, direct = 0, use uring = 0
 : 778.916140 ms
Run with concurrency = 14, store to disk = 1, direct = 0, use uring = 0
 : 1538.574320 ms
Run with concurrency = 14, store to disk = 1, direct = 0, use uring = 1
 : 1497.650033 ms
Run with concurrency = 14, store to disk = 1, direct = 1, use uring = 0
 : 946.505283 ms
Run with concurrency = 14, store to disk = 1, direct = 1, use uring = 1
 : 872.751521 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 16, store to disk = 0, direct = 0, use uring = 0
 : 669.623079 ms
Run with concurrency = 16, store to disk = 1, direct = 0, use uring = 0
 : 1424.179659 ms
Run with concurrency = 16, store to disk = 1, direct = 0, use uring = 1
 : 1458.929794 ms
Run with concurrency = 16, store to disk = 1, direct = 1, use uring = 0
 : 879.752639 ms
Run with concurrency = 16, store to disk = 1, direct = 1, use uring = 1
 : 987.317027 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 18, store to disk = 0, direct = 0, use uring = 0
 : 611.685649 ms
Run with concurrency = 18, store to disk = 1, direct = 0, use uring = 0
 : 2418.678168 ms
Run with concurrency = 18, store to disk = 1, direct = 0, use uring = 1
 : 1459.609819 ms
Run with concurrency = 18, store to disk = 1, direct = 1, use uring = 0
 : 865.681722 ms
Run with concurrency = 18, store to disk = 1, direct = 1, use uring = 1
 : 884.537595 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 20, store to disk = 0, direct = 0, use uring = 0
 : 572.201385 ms
Run with concurrency = 20, store to disk = 1, direct = 0, use uring = 0
 : 2631.481062 ms
Run with concurrency = 20, store to disk = 1, direct = 0, use uring = 1
 : 1380.879326 ms
Run with concurrency = 20, store to disk = 1, direct = 1, use uring = 0
 : 875.950846 ms
Run with concurrency = 20, store to disk = 1, direct = 1, use uring = 1
 : 872.418497 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 22, store to disk = 0, direct = 0, use uring = 0
 : 518.975461 ms
Run with concurrency = 22, store to disk = 1, direct = 0, use uring = 0
 : 1281.144535 ms
Run with concurrency = 22, store to disk = 1, direct = 0, use uring = 1
 : 1341.479703 ms
Run with concurrency = 22, store to disk = 1, direct = 1, use uring = 0
 : 861.209592 ms
Run with concurrency = 22, store to disk = 1, direct = 1, use uring = 1
 : 890.015725 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 24, store to disk = 0, direct = 0, use uring = 0
 : 486.925149 ms
Run with concurrency = 24, store to disk = 1, direct = 0, use uring = 0
 : 1262.528753 ms
Run with concurrency = 24, store to disk = 1, direct = 0, use uring = 1
 : 1267.014048 ms
Run with concurrency = 24, store to disk = 1, direct = 1, use uring = 0
 : 873.210672 ms
Run with concurrency = 24, store to disk = 1, direct = 1, use uring = 1
 : 908.068466 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 26, store to disk = 0, direct = 0, use uring = 0
 : 466.603771 ms
Run with concurrency = 26, store to disk = 1, direct = 0, use uring = 0
 : 1252.972816 ms
Run with concurrency = 26, store to disk = 1, direct = 0, use uring = 1
 : 1282.536363 ms
Run with concurrency = 26, store to disk = 1, direct = 1, use uring = 0
 : 858.094422 ms
Run with concurrency = 26, store to disk = 1, direct = 1, use uring = 1
 : 889.864092 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 28, store to disk = 0, direct = 0, use uring = 0
 : 454.630329 ms
Run with concurrency = 28, store to disk = 1, direct = 0, use uring = 0
 : 1201.286920 ms
Run with concurrency = 28, store to disk = 1, direct = 0, use uring = 1
 : 1232.661758 ms
Run with concurrency = 28, store to disk = 1, direct = 1, use uring = 0
 : 875.786470 ms
Run with concurrency = 28, store to disk = 1, direct = 1, use uring = 1
 : 922.097025 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 30, store to disk = 0, direct = 0, use uring = 0
 : 495.070279 ms
Run with concurrency = 30, store to disk = 1, direct = 0, use uring = 0
 : 1206.026119 ms
Run with concurrency = 30, store to disk = 1, direct = 0, use uring = 1
 : 1223.569305 ms
Run with concurrency = 30, store to disk = 1, direct = 1, use uring = 0
 : 874.067966 ms
Run with concurrency = 30, store to disk = 1, direct = 1, use uring = 1
 : 929.383958 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 32, store to disk = 0, direct = 0, use uring = 0
 : 435.855607 ms
Run with concurrency = 32, store to disk = 1, direct = 0, use uring = 0
 : 1216.080516 ms
Run with concurrency = 32, store to disk = 1, direct = 0, use uring = 1
 : 1223.707928 ms
Run with concurrency = 32, store to disk = 1, direct = 1, use uring = 0
 : 871.715323 ms
Run with concurrency = 32, store to disk = 1, direct = 1, use uring = 1
 : 936.795530 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 34, store to disk = 0, direct = 0, use uring = 0
 : 427.075910 ms
Run with concurrency = 34, store to disk = 1, direct = 0, use uring = 0
 : 1216.118478 ms
Run with concurrency = 34, store to disk = 1, direct = 0, use uring = 1
 : 1227.662183 ms
Run with concurrency = 34, store to disk = 1, direct = 1, use uring = 0
 : 870.027497 ms
Run with concurrency = 34, store to disk = 1, direct = 1, use uring = 1
 : 947.928060 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 36, store to disk = 0, direct = 0, use uring = 0
 : 436.021254 ms
Run with concurrency = 36, store to disk = 1, direct = 0, use uring = 0
 : 1222.277052 ms
Run with concurrency = 36, store to disk = 1, direct = 0, use uring = 1
 : 1226.627501 ms
Run with concurrency = 36, store to disk = 1, direct = 1, use uring = 0
 : 874.256454 ms
Run with concurrency = 36, store to disk = 1, direct = 1, use uring = 1
 : 949.608522 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 38, store to disk = 0, direct = 0, use uring = 0
 : 426.193849 ms
Run with concurrency = 38, store to disk = 1, direct = 0, use uring = 0
 : 1133.525446 ms
Run with concurrency = 38, store to disk = 1, direct = 0, use uring = 1
 : 1247.804088 ms
Run with concurrency = 38, store to disk = 1, direct = 1, use uring = 0
 : 878.324956 ms
Run with concurrency = 38, store to disk = 1, direct = 1, use uring = 1
 : 937.863404 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 40, store to disk = 0, direct = 0, use uring = 0
 : 431.698457 ms
Run with concurrency = 40, store to disk = 1, direct = 0, use uring = 0
 : 1131.224651 ms
Run with concurrency = 40, store to disk = 1, direct = 0, use uring = 1
 : 1246.924182 ms
Run with concurrency = 40, store to disk = 1, direct = 1, use uring = 0
 : 874.366975 ms
Run with concurrency = 40, store to disk = 1, direct = 1, use uring = 1
 : 947.108277 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 42, store to disk = 0, direct = 0, use uring = 0
 : 422.130728 ms
Run with concurrency = 42, store to disk = 1, direct = 0, use uring = 0
 : 1130.728607 ms
Run with concurrency = 42, store to disk = 1, direct = 0, use uring = 1
 : 1243.905110 ms
Run with concurrency = 42, store to disk = 1, direct = 1, use uring = 0
 : 897.637893 ms
Run with concurrency = 42, store to disk = 1, direct = 1, use uring = 1
 : 914.732763 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 44, store to disk = 0, direct = 0, use uring = 0
 : 420.690524 ms
Run with concurrency = 44, store to disk = 1, direct = 0, use uring = 0
 : 1275.922281 ms
Run with concurrency = 44, store to disk = 1, direct = 0, use uring = 1
 : 1285.821973 ms
Run with concurrency = 44, store to disk = 1, direct = 1, use uring = 0
 : 885.540083 ms
Run with concurrency = 44, store to disk = 1, direct = 1, use uring = 1
 : 997.991546 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 46, store to disk = 0, direct = 0, use uring = 0
 : 419.765191 ms
Run with concurrency = 46, store to disk = 1, direct = 0, use uring = 0
 : 1275.787164 ms
Run with concurrency = 46, store to disk = 1, direct = 0, use uring = 1
 : 1303.099823 ms
Run with concurrency = 46, store to disk = 1, direct = 1, use uring = 0
 : 891.274178 ms
Run with concurrency = 46, store to disk = 1, direct = 1, use uring = 1
 : 956.722336 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
Run with concurrency = 48, store to disk = 0, direct = 0, use uring = 0
 : 417.375889 ms
Run with concurrency = 48, store to disk = 1, direct = 0, use uring = 0
 : 1305.577293 ms
Run with concurrency = 48, store to disk = 1, direct = 0, use uring = 1
 : 1437.136640 ms
Run with concurrency = 48, store to disk = 1, direct = 1, use uring = 0
 : 896.589395 ms
Run with concurrency = 48, store to disk = 1, direct = 1, use uring = 1
 : 999.392985 ms
\\\\\\\\\\\\\\\\\\\\\\\\\\\
@boxerab boxerab closed this as completed Apr 11, 2024
@boxerab boxerab reopened this Apr 11, 2024
@isilence
Copy link
Collaborator

What is concurrency for non io_uring? Are you comparing 1 thread for io_uring vs multiple threads for the other one?

@boxerab
Copy link
Author

boxerab commented Apr 11, 2024

What is concurrency for non io_uring? Are you comparing 1 thread for io_uring vs multiple threads for the other one?

0 or 1 represents false or true.

For example

Run with concurrency = 48, store to disk = 1, direct = 1, use uring = 1
 : 999.392985 ms

represents 48 threads each working on small buffers and then queuing the buffers for storage with uring using O_DIRECT

@isilence
Copy link
Collaborator

So, let me confirm, "concurrency = 2, use uring = 1" means there are 2 threads, each thread has an io_uring instance, and each thread keep one request inflight, i.e. QD=1? Similar to this:

void thread_fn(ring) {
	while (1) {
		sqe = get_sqe();
		prep_write(sqe);
		io_uring_submit(ring);

		cqe = io_uring_wait();
		handle(cqe);
	}
}

@boxerab
Copy link
Author

boxerab commented Apr 12, 2024

So, let me confirm, "concurrency = 2, use uring = 1" means there are 2 threads, each thread has an io_uring instance, and each thread keep one request inflight, i.e. QD=1? Similar to this:

void thread_fn(ring) {
	while (1) {
		sqe = get_sqe();
		prep_write(sqe);
		io_uring_submit(ring);

		cqe = io_uring_wait();
		handle(cqe);
	}
}

All serializing theads share the same uring queue

See code below: the first thread will create a queue, and all the others will share that queue.

bool FileIOUring::initQueue(uint32_t shared_ring_fd)
{
	if (shared_ring_fd){
		struct io_uring_params p;
		memset(&p, 0, sizeof(p));
		p.flags = IORING_SETUP_ATTACH_WQ;
		p.wq_fd = shared_ring_fd;
		int ret = io_uring_queue_init_params(QD, &ring, &p);
		if (ret < 0) {
			printf("io_uring_queue_init_params: %s\n", strerror(-ret));
			close();
			return false;
		}

	} else {
		int ret = io_uring_queue_init(QD, &ring, 0);
		if(ret < 0)
		{
			printf("io_uring_queue_init: %s\n", strerror(-ret));
			close();
			return false;
		}
	}

	return true;
}

@isilence
Copy link
Collaborator

Do you have a reproducer? It's not clear what the tool is doing, and without understanding that any performance reasoning would be futile.

When I said separate rings, it means there are separate struct io_uring, each separately initialised with io_uring_queue_init*(), regardless whether it's with IORING_SETUP_ATTACH_WQ set or not. Each such io_uring instance will have a separate submission/completion queue pair. IORING_SETUP_ATTACH_WQ does nothing about it and is a separate optimisation.

@boxerab
Copy link
Author

boxerab commented Apr 12, 2024

Do you have a reproducer?

This is a moderately complex C++ project. If you're interested, I can share it with you.

@axboe
Copy link
Owner

axboe commented Apr 12, 2024

Please do share it - even if it's complicated, just being able to run what you are running, tracing will often tell us a lot about how it's done without needing to fully read and comprehend the sample program.

@boxerab
Copy link
Author

boxerab commented Apr 12, 2024

Great, here is the project.

@axboe
Copy link
Owner

axboe commented Apr 12, 2024

And please include also how you are running it. The goal is to make this as trivial as possible for someone to reproduce :-)

@boxerab
Copy link
Author

boxerab commented Apr 12, 2024

I've added an INSTALL file here that details how to run with default settings.

@YoSTEALTH
Copy link
Contributor

It would be nice if simple benchmarks was added to example/benchmark this way other people can test to see how Liburing runs on their systems! It would also help with comparing different languages implementing Liburing as well.

@isilence
Copy link
Collaborator

@YoSTEALTH, not in liburing per se, but for storage there is fio/t/io_uring, someone may even adapt it to liburing and submit a patch

@isilence
Copy link
Collaborator

@boxerab, hopefully we'll find time to look at it, but I suspect the comparison is not apple to apples without even really looking at numbers. When you compare synchronous with asynchronous, it's usually either:

a) There are N threads in both cases, and both run QD1 per thread (i.e. one requests is executing in parallel per thread). In this case the asynchronous inteface basically runs synchronously, which is not good.

b) The asynchronous interface runs just 1 thread but QD=N, i.e. executes all requests in parallel. In this case the async interface may likely lose in throughput and/or latency, but the key is that it takes much less CPU.

And I don't understand which case is yours. There can be more options, a combination of previous two, or for instance N threads generate IO requests, send them to a single IO thread, which executes it via io_uring. But a lot would depend on how it actually implemented.

@boxerab
Copy link
Author

boxerab commented Apr 15, 2024

b) The asynchronous interface runs just 1 thread but QD=N, i.e. executes all requests in parallel. In this case the async interface may likely lose in throughput and/or latency, but the key is that it takes much less CPU.

This is how the benchmark works, but QD=4 for all concurrency levels. Perhaps it should match the concurrency.

I will measure CPU usage for each configuration and see how that looks.

@boxerab
Copy link
Author

boxerab commented Apr 15, 2024

Here are results comparing CPU usage for uring vs synchronous with O_DIRECT. Timing and CPU usage are both identical.

$ /usr/bin/time -v ./iobench  -c 48 -d -s
Run with concurrency = 48, store to disk = true, direct = true, use uring = false
 : 930.559021 ms
	Command being timed: "./iobench -c 48 -d -s"
	User time (seconds): 19.41
	System time (seconds): 0.99
	Percent of CPU this job got: 2104%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 170112
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 44275
	Voluntary context switches: 3386
	Involuntary context switches: 169
	Swaps: 0
	File system inputs: 8
	File system outputs: 5500888
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

$ /usr/bin/time -v ./iobench  -c 48 -d
Run with concurrency = 48, store to disk = true, direct = true, use uring = true
 : 939.254022 ms
	Command being timed: "./iobench -c 48 -d"
	User time (seconds): 17.72
	System time (seconds): 3.56
	Percent of CPU this job got: 2084%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.02
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1618560
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 405925
	Voluntary context switches: 22384
	Involuntary context switches: 6744
	Swaps: 0
	File system inputs: 8
	File system outputs: 5500888
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants