Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multithreaded buse #11

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

divinity76
Copy link

its messy, but seems to work fine with the busexmp.c (not tested extensively, but no problems encountered on ext2 and btrfs)

  • it has no safeguards against creating excessive amount of threads. pretty sure it has over 10 threads running (for a split second) during mkfs.btrfs - but again, no problems encountered. fsck can't detect any problems on ext2, "btrfs check --repair" can't detect any errors on btrfs.

  • reading write requests from nbd is still single-threaded (but processing/writing them are multithreaded) - not sure how to fix that, i guess i'll need several sockets to nbd

  • writing responses to nbd is mutex locked (effectively single-threaded), again, not sure how to fix that, i guess i'll need several sockets to nbd

i totally understand if you don't want to merge this, but i'd like an opinion of it either way ^^

its messy, but seems to work fine with the busexmp.c (not tested extensively, but no problems encountered on ext2 and btrfs) 

- it has no safeguards against creating excessive amount of threads. pretty sure it has over 10 threads running (for a split second) during mkfs.btrfs - but again, no problems encountered. fsck can't detect any problems on ext2,  "btrfs check --repair" can't detect any errors on btrfs. 

- reading write requests from nbd is still single-threaded (but processing/writing them are multithreaded) - not sure how to fix that, i guess i'll need several sockets to nbd

- writing responses to nbd is mutex locked (effectively single-threaded), again, not sure how to fix that, i guess i'll need several sockets to nbd

i totally understand if you don't want to merge this, but i'd like an opinion of it either way ^^
@bandi13
Copy link

bandi13 commented Nov 4, 2016

I have concerns about your implementation. First it doesn't ensure that the replies are sent in the order the requests come in. You'll need to store the threads in a queue and handle the responses in order. Second are you sure that there are actually multiple requests coming in to NBD at once? What kinds of performance improvements did you see with multi- vs single-threading?

@divinity76
Copy link
Author

divinity76 commented Nov 4, 2016

thank you for taking a look! :)

First it doesn't ensure that the replies are sent in the order the requests come in. - no, and i don't need to, the reply.handle is used to identify which request im responding to.
You'll need to store the threads in a queue and handle the responses in order.
no, i don't believe that is the case. see above.

Second are you sure that there are actually multiple requests coming in to NBD at once
yes i am. nbd does not wait for 1 request to finish before issuing more requests. for instance, using mkfs.btrfs will create a lot of simultaneous requests.

What kinds of performance improvements did you see with multi- vs single-threading?
i need more time to test that, but the theory is that it can be much faster at handling multiple slow read/write requests at once

@bandi13
Copy link

bandi13 commented Nov 4, 2016

Cool! You're right. Thanks for explaining. I'll have to play with that too.

@divinity76
Copy link
Author

divinity76 commented Nov 4, 2016

you asked about performance. note that i believe there's several things that can be improved on the multithreaded code, for instance, starting a new thread for every little minor request is probably crazy, having a thread pool would probably be faster, remember that actually creating a thread isn't free, threads should probably be resued, etc, anyway,

i don't have anything interesting to test with at the moment, and i believe busexmp.c won't benefit much (if any) from multithreading, still:

(warning: sorry, i do not have access to a completely quiet system to test on, there will be some noise. feel free to do your own tests ofc);

creating a btrfs filesystem 100 times, single threaded buse.c running at /dev/ndb1 , and multithreaded busemt.c running at /dev/ndb0

root@Deb9DEtestX:/home/hanshenrik/BUSE# ./bench.php >/dev/null
single_timeused: 1.6004650592804
multi_timeused: 1.7799780368805
winner: single won!
margin: 0.1795129776001
root@Deb9DEtestX:/home/hanshenrik/BUSE# ./bench.php >/dev/null
single_timeused: 1.5808670520782
multi_timeused: 1.6301081180573
winner: single won!
margin: 0.049241065979004
root@Deb9DEtestX:/home/hanshenrik/BUSE# ./bench.php >/dev/null
single_timeused: 1.5171689987183
multi_timeused: 1.6563220024109
winner: single won!
margin: 0.13915300369263
root@Deb9DEtestX:/home/hanshenrik/BUSE# ./bench.php >/dev/null
single_timeused: 1.52512383461
multi_timeused: 1.5709359645844
winner: single won!
margin: 0.045812129974365
root@Deb9DEtestX:/home/hanshenrik/BUSE# ./bench.php >/dev/null
single_timeused: 1.538360118866
multi_timeused: 1.5972349643707
winner: single won!
margin: 0.058874845504761
root@Deb9DEtestX:/home/hanshenrik/BUSE# ./bench.php >/dev/null
single_timeused: 1.5681219100952
multi_timeused: 1.7586491107941
winner: single won!
margin: 0.19052720069885

bench.php:

#!/usr/bin/php
<?php
$tests=100;
$starttime=microtime(true);
for($i=0;$i<$tests;++$i){
system("mkfs.btrfs /dev/nbd0 -f");
}
$endtime=microtime(true);
$multi_timeused=($endtime-$starttime);
$starttime=microtime(true);
for($i=0;$i<$tests;++$i){
system("mkfs.btrfs /dev/nbd1 -f");
}
$endtime=microtime(true);
$single_timeused=($endtime-$starttime);
fwrite(STDERR,"single_timeused: ".$single_timeused.PHP_EOL);
fwrite(STDERR,"multi_timeused: ".$multi_timeused.PHP_EOL);
fwrite(STDERR,"winner: ");
if($single_timeused===$multi_timeused){
fwrite(STDERR,"IT's a draw!".PHP_EOL);
}elseif($single_timeused<$multi_timeused){
fwrite(STDERR,"single won!".PHP_EOL);
}else{
fwrite(STDERR,"multi won!".PHP_EOL);
}
fwrite(STDERR,"margin: ".abs($multi_timeused-$single_timeused).PHP_EOL);

(and if you wanna complain about how shitty PHP is, please do it elsewhere, like my email or /r/lolphp )

hdparm -Tt:

root@Deb9DEtestX:/mt# hdparm -Tt /dev/nbd0

/dev/nbd0:
 Timing cached reads:   17142 MB in  2.00 seconds = 8577.03 MB/sec
 Timing buffered disk reads: 128 MB in  0.16 seconds = 818.95 MB/sec
root@Deb9DEtestX:/mt# hdparm -Tt /dev/nbd0

/dev/nbd0:
 Timing cached reads:   17624 MB in  2.00 seconds = 8818.65 MB/sec
 Timing buffered disk reads: 128 MB in  0.17 seconds = 766.66 MB/sec
root@Deb9DEtestX:/mt# hdparm -Tt /dev/nbd0

/dev/nbd0:
 Timing cached reads:   17380 MB in  2.00 seconds = 8696.67 MB/sec
 Timing buffered disk reads: 128 MB in  0.17 seconds = 775.21 MB/sec
root@Deb9DEtestX:/mt# 
root@Deb9DEtestX:/mt# hdparm -Tt /dev/nbd1

/dev/nbd1:
 Timing cached reads:   16590 MB in  2.00 seconds = 8301.21 MB/sec
 Timing buffered disk reads: 128 MB in  0.12 seconds = 1071.70 MB/sec
root@Deb9DEtestX:/mt# hdparm -Tt /dev/nbd1

/dev/nbd1:
 Timing cached reads:   16882 MB in  2.00 seconds = 8446.80 MB/sec
 Timing buffered disk reads: 128 MB in  0.10 seconds = 1247.83 MB/sec
root@Deb9DEtestX:/mt# hdparm -Tt /dev/nbd1

/dev/nbd1:
 Timing cached reads:   17108 MB in  2.00 seconds = 8560.67 MB/sec
 Timing buffered disk reads: 128 MB in  0.11 seconds = 1168.79 MB/sec

dd WRITE test:

root@Deb9DEtestX:/mt# dd if=/dev/zero of=/dev/nbd0 bs=1M
dd: error writing '/dev/nbd0': No space left on device
129+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.08685 s, 1.5 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/zero of=/dev/nbd0 bs=1M
dd: error writing '/dev/nbd0': No space left on device
129+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0737198 s, 1.8 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/zero of=/dev/nbd0 bs=1M
dd: error writing '/dev/nbd0': No space left on device
129+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0751926 s, 1.8 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/zero of=/dev/nbd0 bs=1M
dd: error writing '/dev/nbd0': No space left on device
129+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0694624 s, 1.9 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/zero of=/dev/nbd1 bs=1M
dd: error writing '/dev/nbd1': No space left on device
129+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0897694 s, 1.5 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/zero of=/dev/nbd1 bs=1M
dd: error writing '/dev/nbd1': No space left on device
129+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0733665 s, 1.8 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/zero of=/dev/nbd1 bs=1M
dd: error writing '/dev/nbd1': No space left on device
129+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0672646 s, 2.0 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/zero of=/dev/nbd1 bs=1M
dd: error writing '/dev/nbd1': No space left on device
129+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0733574 s, 1.8 GB/s

dd READ test:

root@Deb9DEtestX:/mt# dd if=/dev/nbd0 of=/dev/null bs=1M
128+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0422518 s, 3.2 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/nbd0 of=/dev/null bs=1M
128+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0493657 s, 2.7 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/nbd0 of=/dev/null bs=1M
128+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0415744 s, 3.2 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/nbd0 of=/dev/null bs=1M
128+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0433726 s, 3.1 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/nbd1 of=/dev/null bs=1M
128+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.041047 s, 3.3 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/nbd1 of=/dev/null bs=1M
128+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0481382 s, 2.8 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/nbd1 of=/dev/null bs=1M
128+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0356341 s, 3.8 GB/s
root@Deb9DEtestX:/mt# dd if=/dev/nbd1 of=/dev/null bs=1M
128+0 records in
128+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 0.0448003 s, 3.0 GB/s

i guess i should test actual mounted filesystem performance too, suggestions?

@divinity76
Copy link
Author

hmm, just occurred to me that the kernel's I/O caches may have screwed up my tests bigtime, i'm not sure. maybe try dd oflag=sync ?

@bandi13
Copy link

bandi13 commented Nov 5, 2016

Nice work on the testing. Yes, you're right about the kernel's I/O caches. They can get pretty big. I've been working on some test programs to validate random accesses as well as filesystem level tests. Take a look and see if they're of use. They can also help validate your system to make sure what you write in is what is read back out.

(For the record, the PHP thing didn't even cross my mind. Whatever gets the job done. You can always change it later if it's a problem.)

@bandi13
Copy link

bandi13 commented Nov 7, 2016

I was thinking about this: what happens if there's a read on a sector followed by a write. Two threads are started, and the write thread is executed first, then the read. You may lose data if it doesn't handle the requests in sequence no?

@divinity76
Copy link
Author

divinity76 commented Nov 7, 2016

yeah that would be bad. however, i believe the kernel won't do that..?

it would already be in memory, the kernel could probably get those bytes from there, from it's own I/O caches, and it'd be much faster. and you know how the kernel devs love to micro-optimize the shit out of everything? (except, ahm, /proc ),

i asked this question on a linux support channel (##Linux @ freenode) , here's what i got (noise removed):

<hanshenrik> may the kernel send a write request to a block device, then send a read request to the same block device BEFORE the write request has finished?  
<hanshenrik> and if so, does the kernel expect the new data (not yet written), or the old data? 
<hanshenrik> or a horribly broken mix of the 2?
<hanshenrik> err, i mean, a read request to the same sectors*
<[R]> hanshenrik: the kenrel has cache
<hanshenrik> [R], im making a block device, it will have different speeds for reads and writes, namely, reads will be much faster. i guess i shouldn't worry about the situation of handling a read request to sectors that are currently being written by another request? 
<hanshenrik> (i won't crash or anything, but the data returned from such a request would be a random-ish combination of both)
<[R]> hanshenrik: the kernel handles all of that
<hanshenrik> thanks [R]

seems promising :) should probably ask on the Linux Kernel Mailing List too. (PS, I trust [R] , he's an ##linux oldtimer who's proven himself knowledgeable plenty times for years..)

@divinity76
Copy link
Author

divinity76 commented Nov 7, 2016

im just GUESSING, and, testing needs to be done to be sure.

proc1 want to read sector 1-10
kernel sends a read request to the BD
(request not yet finished) proc2 want to read sector 1-10 (or 5-10);
kernel notice that a request to read those sectors has already been scheduled, and does not send a new request.
proc3 want to read sector 7-15
kernel notice a request to read sector 7-10 has already started, and will issue a request to
read sector 11-15...

now, what if, all the time, there was a process 0 already writing to sector 1-8? i believe the kernel would have just issued a request to read sector 9-10 instead of 1-10 for proc1-2, and 11-15 for proc3

but if there's then a proc 4 wanting to write those sectors before the read requests have been finished?

hmmmmmmmmmmmmmmmmmmmmmm idk. testing should definitely be done.

@divinity76
Copy link
Author

divinity76 commented Nov 7, 2016

is it possible the kernel would just have lied to the other programs and given them what proc 4 wanted to write, rather than what was actually on the BD? or would the kernel stall proc 4 write? idk. my guess goes to a stall of proc 4 write. should test.

@divinity76
Copy link
Author

divinity76 commented Nov 7, 2016

(but, if the kernel just issues the write request for proc 4 instantly, and don't want to lie to the other processes about what was actually on the BD at the time of the read request versus what is scheduled to be written, you're right, we have a problem. should test to be sure)

@nixomose
Copy link

All reads and writes (except for directio) go through the page cache.
So the data from the write request will be in cache when all the read requests come in and they will be satisfied by the cache. The kernel will send the writes to the block device and when the read request for the blocks it doesn't have in cache comes in, it will ask the block device for them. There's no overlap. If there was, it would be in page cache.

@fruffy
Copy link

fruffy commented Feb 4, 2018

This looks like an interesting contribution, is it still in consideration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants