-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify that chunker sizes are in bytes #5923
Conversation
Clarify that the Rabin fingerprint chunker is in bytes, and recommend using larger chunk sizes than shown in the provided examples. People are getting confused over chunker sizes and incorrectly assuming they're in kiB and not bytes. License: MIT Signed-off-by: Daniel Aleksandersen <code@daniel.priv.no>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
rabin-[min]-[avg]-[max] (where min/avg/max refer to the desired | ||
chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'. | ||
|
||
The following examples use very small byte sizes to demonstrate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1k isn't that small and users definitely shouldn't use 2048*1024
(2MiB) chunks. Let's just explain the example (e.g., "For example, to chunk a file into 1KiB chunks...").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m never suggesting using 2 MiB chunks. rabin-262144-524288-1048576 is in bytes and not kilobytes, so it’s 128 kiB–256 kiB–512 kiB. This was what this patch was intended to help clarify.
1 kiB is ridiculously small in this context. You don’t ever want to go through DHT and peer discovery for a file that is split into a thoiusand 1 kiB chunks. The protocol overhead would be enormous. Keep in mind that you’re also storing each individual chunk in a file that takes up a 256 kiB block on most file systems, so the disk packing wastage would also be enormous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don’t ever want to go through DHT and peer discovery for a file that is split into a thoiusand 1 kiB chunks.
You won't have to do that. You'll go through the DHT for the root node (which will likely be smaller than 1KiB as it doesn't contain any actual data).
Keep in mind that you’re also storing each individual chunk in a file that takes up a 256 kiB block on most file systems, so the disk packing wastage would also be enormous.
Most filesystem have 4KiB blocks. Also note that we're moving towards a datastore that doesn't store each chunk in a separate file. However, I do agree that a 1KiB is small.
I’m never suggesting using 2 MiB chunks. rabin-262144-524288-1048576 is in bytes and not kilobytes, so it’s 128 kiB–256 kiB–512 kiB. This was what this patch was intended to help clarify.
Ah, yeah, you're right. We might want to just bump those sizes up instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, I'd be fine merging this as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that we're moving towards a datastore that doesn't store each chunk in a separate file.
@Stebalien Where can I find more information about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -78,12 +78,16 @@ You can now refer to the added file in a gateway, like so: | |||
|
|||
The chunker option, '-s', specifies the chunking strategy that dictates | |||
how to break files into blocks. Blocks with same content can | |||
be deduplicated. The default is a fixed block size of | |||
be deduplicated. Different chunking strategies will produce different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We might want to move the pros/cons of chunking to a new paragraph (this is pretty choppy). However, your change is still an improvement so we can mess with that later if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is one of the most important things to know about this option and chunking. I moved it up so people would read it early before they get distracted by all the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My complaint is that it's "fact, fact, fact..." with no relationship between the facts. But didn't flow all that well before so we can fix this later.
@@ -78,12 +78,16 @@ You can now refer to the added file in a gateway, like so: | |||
|
|||
The chunker option, '-s', specifies the chunking strategy that dictates | |||
how to break files into blocks. Blocks with same content can | |||
be deduplicated. The default is a fixed block size of | |||
be deduplicated. Different chunking strategies will produce different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My complaint is that it's "fact, fact, fact..." with no relationship between the facts. But didn't flow all that well before so we can fix this later.
rabin-[min]-[avg]-[max] (where min/avg/max refer to the desired | ||
chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'. | ||
|
||
The following examples use very small byte sizes to demonstrate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don’t ever want to go through DHT and peer discovery for a file that is split into a thoiusand 1 kiB chunks.
You won't have to do that. You'll go through the DHT for the root node (which will likely be smaller than 1KiB as it doesn't contain any actual data).
Keep in mind that you’re also storing each individual chunk in a file that takes up a 256 kiB block on most file systems, so the disk packing wastage would also be enormous.
Most filesystem have 4KiB blocks. Also note that we're moving towards a datastore that doesn't store each chunk in a separate file. However, I do agree that a 1KiB is small.
I’m never suggesting using 2 MiB chunks. rabin-262144-524288-1048576 is in bytes and not kilobytes, so it’s 128 kiB–256 kiB–512 kiB. This was what this patch was intended to help clarify.
Ah, yeah, you're right. We might want to just bump those sizes up instead.
rabin-[min]-[avg]-[max] (where min/avg/max refer to the desired | ||
chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'. | ||
|
||
The following examples use very small byte sizes to demonstrate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, I'd be fine merging this as-is.
Clarify that the Rabin fingerprint chunker is in bytes, and recommend using larger chunk sizes than shown in the provided examples. People are getting confused over chunker sizes and incorrectly assuming they're in kiB and not bytes because of the sizes used in the examples.