Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DirectIO for sampling #101

Open
qizhou opened this issue Nov 24, 2023 · 18 comments
Open

Use DirectIO for sampling #101

qizhou opened this issue Nov 24, 2023 · 18 comments
Assignees

Comments

@qizhou
Copy link
Contributor

qizhou commented Nov 24, 2023

DirectIO will bypass OS buffer, where buffering is unnecessary in our sampling process. Further, it seems that DirectIO will be faster on some FS such as APFS. E.g., using 32 threads sampling on APFS, DirectIO can yield 1GB/s randread (actual disk IO is 1GB/s), while buffer IO will yield 500GB/s (actual disk IO is 2GB/s, no sure where the overhead comes from).

@qizhou
Copy link
Contributor Author

qizhou commented Nov 24, 2023

A library for Go can be found https://github.com/ncw/directio

@qzhodl qzhodl added this to the Devnet-2 milestone Nov 25, 2023
@syntrust
Copy link
Collaborator

syntrust commented Dec 8, 2023

A quick test with Macos (16GB RAM) on an external disk (APFS)
Shard size: 256GB

io type threads noncesTried iostat(MB/s)
non-directio 24 162096(15.5%) 105
directio 24 78888(7.5%) 99.79

Trying to do more research and test on the reason why this is counter-intuitive.

@syntrust
Copy link
Collaborator

DirectIO will bypass OS buffer, where buffering is unnecessary in our sampling process. Further, it seems that DirectIO will be faster on some FS such as APFS. E.g., using 32 threads sampling on APFS, DirectIO can yield 1GB/s randread (actual disk IO is 1GB/s), while buffer IO will yield 500GB/s (actual disk IO is 2GB/s, no sure where the overhead comes from).

Hi Qi, what exactly is the fio command used in this test?

@qzhodl
Copy link
Collaborator

qzhodl commented Dec 11, 2023

A quick test with Macos (16GB RAM) on an external disk (APFS) Shard size: 256GB

io type threads noncesTried iostat(MB/s)
non-directio 24 162096(15.5%) 105
directio 24 78888(7.5%) 99.79
Trying to do more research and test on the reason why this is counter-intuitive.

It will be great to include the testing result that fio directIO/no-directIO ran (also the fio command you were using) with the same config in your test env

@qzhodl
Copy link
Collaborator

qzhodl commented Dec 11, 2023

A quick test with Macos (16GB RAM) on an external disk (APFS) Shard size: 256GB

io type threads noncesTried iostat(MB/s)
non-directio 24 162096(15.5%) 105
directio 24 78888(7.5%) 99.79
Trying to do more research and test on the reason why this is counter-intuitive.

Also you'd better clarify the testing result was based on the PR (#121), right?

@qizhou
Copy link
Contributor Author

qizhou commented Dec 12, 2023

Running on my Mac M2 with APFS + USBC 4.0 + Fanxiang S880

  • write 128GB test file
    fio --name=random-write --ioengine=psync --rw=write --bs=1m --size=4g --numjobs=32 --iodepth=1 --thread --offset_increment=4g --filename fio.dat --direct=1

  • read with 64 threads (direct=1)
    fio --name=random-write --ioengine=psync --rw=randread --bs=4k --size=2g --numjobs=64 --iodepth=1 --thread --offset_increment=2g --filename fio.dat --direct=1

fio reports 1.2GB/s, activity monitor reports 1.2GB/s

  • read with 64 threads (direct=0)
    fio --name=random-write --ioengine=psync --rw=randread --bs=4k --size=2g --numjobs=64 --iodepth=1 --thread --offset_increment=2g --filename fio.dat --direct=0

fio reports 700MB/s, activity monitor reports 2.6GB/s

@qizhou
Copy link
Contributor Author

qizhou commented Dec 14, 2023

The performance is closely related to the USB interface. For example, using the above commands with Samsung 990 Pro, I have

  • USBC 3.2 yields 141 MB/s
  • USBC 3.1 yields 27.2MB/s

Using another USBC4.0 + 990 Pro, I have

  • fio reports 1050 MB/s (direct=1) and activity monitor reports 900 MB/s
  • fio reports 641 MB/s (direct=0) and activity monitor reports 2.17 GB/s

@syntrust
Copy link
Collaborator

Test with exactly fio commands with the following hardware:

  • MacBookPro 2.9 GHz 6-Core Intel Core i9

  • fanxiang PS2000 USB 3.2

read with 64 threads (direct=1)

  • fio reports 138MB/s

  • activity monitor reports 141MB/s

read with 64 threads (direct=0)

  • fio reports 140MB/s

  • activity monitor reports 140MB/s

No big difference has been observed. Could it be that the USB 3.2 capacity has already been reached?

@syntrust
Copy link
Collaborator

Another round of tests with local APPLE SSD(APFS) and a file of size 256GB, with the same commands.

read with 64 threads (direct=1)

  • fio reports 603MB/s

  • activity monitor reports 621MB/s

read with 64 threads (direct=0)

  • fio reports 541MB/s

  • activity monitor reports 578MB/s

Could see a little better with direct=1.

@qzhodl qzhodl modified the milestones: Devnet-2, Devnet-3 Dec 15, 2023
@syntrust
Copy link
Collaborator

A simple sampling test using "github.com/ncw/directio" with local APPLE SSD(APFS) and a file of size 256GB.
Test code can be found in https://github.com/ethstorage/es-node/tree/directio/cmd/directio

dl@MBPDL directio % ./directio -f=256GB   
Start random sampling from file 256GB with 12 threads for 12 seconds long, NOT using DirectIO
File size: 274877906944
Total sampling times: 1011599
dl@MBPDL directio % ./directio -f=256GB -d
Start random sampling from file 256GB with 12 threads for 12 seconds long, using DirectIO
File size: 274877906944
Total sampling times: 1000256

If the test code makes sense, DirectIO does not seem faster than non-directIO.

@qzhodl
Copy link
Collaborator

qzhodl commented Dec 18, 2023

what is the stat that activity monitor reported, was it the same as the stats above?

@syntrust
Copy link
Collaborator

what is the stat that activity monitor reported, was it the same as the stats above?

Observed activity monitor when repeating the above tests, in both cases (using DirectIO and NOT using DirectIO) the Data read/sec reads about 700MB/s and there were no noticeable or stable differences.

@qzhodl
Copy link
Collaborator

qzhodl commented Dec 26, 2023

Running on my Mac M2 with APFS + USBC 4.0 + Fanxiang S880

  • write 128GB test file
    fio --name=random-write --ioengine=psync --rw=write --bs=1m --size=4g --numjobs=32 --iodepth=1 --thread --offset_increment=4g --filename fio.dat --direct=1
  • read with 64 threads (direct=1)
    fio --name=random-write --ioengine=psync --rw=randread --bs=4k --size=2g --numjobs=64 --iodepth=1 --thread --offset_increment=2g --filename fio.dat --direct=1

fio reports 1.2GB/s, activity monitor reports 1.2GB/s

  • read with 64 threads (direct=0)
    fio --name=random-write --ioengine=psync --rw=randread --bs=4k --size=2g --numjobs=64 --iodepth=1 --thread --offset_increment=2g --filename fio.dat --direct=0

fio reports 700MB/s, activity monitor reports 2.6GB/s

I can reproduce similar result on the m3 MBP in the office.

direct = 1
fio reported: 1.3 GB/s; activity monitor: 1.3 GB/s

direct =0
fio reported: 755 MB/s; activity monitor: 2.9 GB/s

@qzhodl
Copy link
Collaborator

qzhodl commented Dec 29, 2023

es-node test result on the m3 MBP:

main branch: (direct =0)
mining.threads-per-shard = 128; sampling rate= 47%; activity monitor: 1.1 GB/s

directUI branch: (direct =1)
mining.threads-per-shard = 128; sampling rate= 43%; activity monitor: 1.5 GB/s

@syntrust
Copy link
Collaborator

syntrust commented Dec 29, 2023

The test result of the go sampling test code:
The activity monitor reads no big difference: around 1.02 GB/s

es@ess-MacBook-Pro es-node % go run cmd/directio/sampling.go -f=./es-data/shard-0.dat -t=12
Start random sampling from file ./es-data/shard-0.dat with 12 threads for 12 seconds long, NOT using DirectIO
Total sampling times: 606301
Total sampling times: 607567
Total sampling times: 602730
Total sampling times: 601648 
es@ess-MacBook-Pro es-node % go run cmd/directio/sampling.go -f=./es-data/shard-0.dat -t=12 -d
Start random sampling from file ./es-data/shard-0.dat with 12 threads for 12 seconds long, using directio
Total sampling times: 610625
Total sampling times: 608367
Total sampling times: 608455
Total sampling times: 610870

@qzhodl
Copy link
Collaborator

qzhodl commented Dec 29, 2023

606301

606301 * 4 * 1024 / 12 = 206 MB/s, and it is much lower than the activity monitor reported 1.02 GB/s. We may need to find out where did the overhead come from

@qzhodl
Copy link
Collaborator

qzhodl commented Jan 5, 2024

Since it is not the main expected users' scenario, we can remove it from the milestone 3

@qzhodl qzhodl removed this from the Devnet-3 milestone Jan 5, 2024
@qizhou
Copy link
Contributor Author

qizhou commented Jan 5, 2024

Yes, I think we can optimize it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants