Clarify that chunker sizes are in bytes #5923

da2x · 2019-01-13T05:06:38Z

Clarify that the Rabin fingerprint chunker is in bytes, and recommend using larger chunk sizes than shown in the provided examples. People are getting confused over chunker sizes and incorrectly assuming they're in kiB and not bytes because of the sizes used in the examples.

Clarify that the Rabin fingerprint chunker is in bytes, and recommend using larger chunk sizes than shown in the provided examples. People are getting confused over chunker sizes and incorrectly assuming they're in kiB and not bytes. License: MIT Signed-off-by: Daniel Aleksandersen <code@daniel.priv.no>

Stebalien

Thanks!

Stebalien · 2019-01-13T15:36:08Z

core/commands/add.go

+rabin-[min]-[avg]-[max] (where min/avg/max refer to the desired
+chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.
+
+The following examples use very small byte sizes to demonstrate the


1k isn't that small and users definitely shouldn't use 2048*1024 (2MiB) chunks. Let's just explain the example (e.g., "For example, to chunk a file into 1KiB chunks...").

I’m never suggesting using 2 MiB chunks. rabin-262144-524288-1048576 is in bytes and not kilobytes, so it’s 128 kiB–256 kiB–512 kiB. This was what this patch was intended to help clarify.

1 kiB is ridiculously small in this context. You don’t ever want to go through DHT and peer discovery for a file that is split into a thoiusand 1 kiB chunks. The protocol overhead would be enormous. Keep in mind that you’re also storing each individual chunk in a file that takes up a 256 kiB block on most file systems, so the disk packing wastage would also be enormous.

You don’t ever want to go through DHT and peer discovery for a file that is split into a thoiusand 1 kiB chunks.

You won't have to do that. You'll go through the DHT for the root node (which will likely be smaller than 1KiB as it doesn't contain any actual data).

Keep in mind that you’re also storing each individual chunk in a file that takes up a 256 kiB block on most file systems, so the disk packing wastage would also be enormous.

Most filesystem have 4KiB blocks. Also note that we're moving towards a datastore that doesn't store each chunk in a separate file. However, I do agree that a 1KiB is small.

I’m never suggesting using 2 MiB chunks. rabin-262144-524288-1048576 is in bytes and not kilobytes, so it’s 128 kiB–256 kiB–512 kiB. This was what this patch was intended to help clarify.

Ah, yeah, you're right. We might want to just bump those sizes up instead.

Otherwise, I'd be fine merging this as-is.

Also note that we're moving towards a datastore that doesn't store each chunk in a separate file.

@Stebalien Where can I find more information about that?

Stebalien · 2019-01-13T15:41:48Z

core/commands/add.go

@@ -78,12 +78,16 @@ You can now refer to the added file in a gateway, like so:

 The chunker option, '-s', specifies the chunking strategy that dictates
 how to break files into blocks. Blocks with same content can
-be deduplicated. The default is a fixed block size of
+be deduplicated. Different chunking strategies will produce different


Nit: We might want to move the pros/cons of chunking to a new paragraph (this is pretty choppy). However, your change is still an improvement so we can mess with that later if you'd like.

This sentence is one of the most important things to know about this option and chunking. I moved it up so people would read it early before they get distracted by all the details.

My complaint is that it's "fact, fact, fact..." with no relationship between the facts. But didn't flow all that well before so we can fix this later.

Stebalien · 2019-01-14T11:29:03Z

core/commands/add.go

@@ -78,12 +78,16 @@ You can now refer to the added file in a gateway, like so:

 The chunker option, '-s', specifies the chunking strategy that dictates
 how to break files into blocks. Blocks with same content can
-be deduplicated. The default is a fixed block size of
+be deduplicated. Different chunking strategies will produce different


My complaint is that it's "fact, fact, fact..." with no relationship between the facts. But didn't flow all that well before so we can fix this later.

Stebalien · 2019-01-14T11:43:01Z

core/commands/add.go

+rabin-[min]-[avg]-[max] (where min/avg/max refer to the desired
+chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.
+
+The following examples use very small byte sizes to demonstrate the


You don’t ever want to go through DHT and peer discovery for a file that is split into a thoiusand 1 kiB chunks.

You won't have to do that. You'll go through the DHT for the root node (which will likely be smaller than 1KiB as it doesn't contain any actual data).

Keep in mind that you’re also storing each individual chunk in a file that takes up a 256 kiB block on most file systems, so the disk packing wastage would also be enormous.

Most filesystem have 4KiB blocks. Also note that we're moving towards a datastore that doesn't store each chunk in a separate file. However, I do agree that a 1KiB is small.

I’m never suggesting using 2 MiB chunks. rabin-262144-524288-1048576 is in bytes and not kilobytes, so it’s 128 kiB–256 kiB–512 kiB. This was what this patch was intended to help clarify.

Ah, yeah, you're right. We might want to just bump those sizes up instead.

Stebalien · 2019-01-14T11:43:23Z

core/commands/add.go

+rabin-[min]-[avg]-[max] (where min/avg/max refer to the desired
+chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.
+
+The following examples use very small byte sizes to demonstrate the


Otherwise, I'd be fine merging this as-is.

da2x requested a review from Kubuxu as a code owner January 13, 2019 05:06

da2x force-pushed the patch-1 branch from 28f340c to 9dc125e Compare January 13, 2019 05:11

da2x force-pushed the patch-1 branch from 9dc125e to 745f82e Compare January 13, 2019 05:14

Stebalien reviewed Jan 13, 2019

View reviewed changes

Stebalien approved these changes Jan 17, 2019

View reviewed changes

Stebalien merged commit 5b6f41a into ipfs:master Jan 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify that chunker sizes are in bytes #5923

Clarify that chunker sizes are in bytes #5923

da2x commented Jan 13, 2019

Stebalien left a comment

Stebalien Jan 13, 2019

da2x Jan 14, 2019

Stebalien Jan 14, 2019

Stebalien Jan 14, 2019

hinshun Jan 30, 2019 •

edited

Loading

Stebalien Jan 30, 2019

Stebalien Jan 13, 2019

da2x Jan 14, 2019

Stebalien Jan 14, 2019

Stebalien Jan 14, 2019

Stebalien Jan 14, 2019

Stebalien Jan 14, 2019

Clarify that chunker sizes are in bytes #5923

Clarify that chunker sizes are in bytes #5923

Conversation

da2x commented Jan 13, 2019

Stebalien left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hinshun Jan 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hinshun Jan 30, 2019 •

edited

Loading