Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why txhashset download response is so resource heavy #2787

Closed
garyyu opened this issue Apr 30, 2019 · 6 comments
Closed

Why txhashset download response is so resource heavy #2787

garyyu opened this issue Apr 30, 2019 · 6 comments

Comments

@garyyu
Copy link
Contributor

garyyu commented Apr 30, 2019

Look at this picture, the CPU load during the txhashset downloading (server side) is quite high, we need check if it's reasonable and whether there's any optimization possibility.

Screen Shot 2019-04-30 at 6 26 54 PM

@garyyu
Copy link
Contributor Author

garyyu commented May 22, 2019

this could be optimized by #2813
to decrease the chance of txhashset packaging operation.

@bladedoyle
Copy link
Contributor

@garyyu In addition to CPU and network use, there are disk space use issues.

I routinely see my grin server generate many gigabytes of txhashset zip files per hour. Those files are left to fill up the disk. I have resorted to adding a cronjob to clean them up. Ex:

*/5 * * * * find /data/grin/.grin -maxdepth 1 -name txhashset_\* -cmin +15 -exec rm -rf {} \;

@hashmap
Copy link
Contributor

hashmap commented May 29, 2019

@bladedoyle this is fixed in 1.1

@garyyu
Copy link
Contributor Author

garyyu commented Jun 7, 2019

After taking #2813, I can investigate this issue again.

And very interesting new graph of CPU usage vs network traffic:

CPU-vs-Network

This time, there's no txhashset creating and packaging CPU load (thanks for @cadmuspeverell 's 12 hours txhashset_archive_interval), pls pay attention to the red circle I draw there, we can find an obvious pattern:

The lower the network traffic (remote node downloading the txhashset zip file) speed must lead to a higher CPU load, that's quite strange and doesn't make sense.

And considering the recent finding about the Rust thread::sleep performance problem (for example: #2773, and #2855), I get this finding:

In util/src/read_write.rs:

pub fn write_all(stream: &mut dyn Write, mut buf: &[u8], timeout: Duration) -> io::Result<()> {
	let sleep_time = Duration::from_micros(10);
	let mut count = Duration::new(0, 0);

	while !buf.is_empty() {
		match stream.write(buf) {
			Ok(0) => {
				return Err(io::Error::new(
					io::ErrorKind::WriteZero,
					"failed to write whole buffer",
				));
			}
			Ok(n) => buf = &buf[n..],
			Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {}
			Err(ref e) if e.kind() == io::ErrorKind::WouldBlock => {}
			Err(e) => return Err(e),
		}
		if !buf.is_empty() {
			thread::sleep(sleep_time);        <<<<  problem is here!  10us sleep
			count += sleep_time;
		} else {
			break;
		}
		if count > timeout {
			return Err(io::Error::new(io::ErrorKind::TimedOut, "writing to stream"));
		}
	}
	Ok(())
}

So, an easy "fix" is to change this 10us sleep to 1ms sleep, since 8000byte/1ms = 8M Byte/s speed limitation should cover 99% case.

Perhaps the perfect fixing solution is to switch from non-blocking stream to blocking stream as in #2855, but that need more discussion and test. Before #2855 got a conclusion, I will write a two-lines PR for this improvement.

@garyyu
Copy link
Contributor Author

garyyu commented Jun 8, 2019

After a fix (changing above 10us to 5ms sleep), the test result is quite good, no more huge CPU-Load increasing when response the txhashset request:

Screen Shot 2019-06-08 at 10 24 02 AM

Note: Each sharp increasing on the network traffic means a txhashset request from a remote node.

And another good point for this 5ms sleep is: it can limit each txhashset request to a maximum 8000 bytes/5ms = 1.6M Bytes/s downloading speed, to avoid much overloading the Grin server's network bandwidth. And if want to enlarge this bandwidth limitation in the future, we can increase that 8000 in the caller of write_all.

@garyyu
Copy link
Contributor Author

garyyu commented Aug 27, 2019

Closed by #2855

@garyyu garyyu closed this as completed Aug 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants