-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add snippet for impacts on parquet file of data type and compression algorithm #26
base: main
Are you sure you want to change the base?
Conversation
…ata, depending on chosen compression algorithm and chosen data type.
// COMMAND ---------- | ||
|
||
/* | ||
This snippet shows who data type for numerical information and compression can affect Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: how
.
This snippet shows who data type for numerical information and compression can affect Spark. | ||
|
||
# Symptom | ||
Storage needs does not match with expectations, for example is higher in output after filtering than in input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, needs do not match...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest for example volume is higher
.
Storage needs does not match with expectations, for example is higher in output after filtering than in input. | ||
|
||
# Explanation | ||
There is difference in in type when reading the data and type when writing it, causing a loss of compression perfomance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, double in
.
|
||
// COMMAND ---------- | ||
|
||
// We are going to demonstrate our purpose by converting the same numerica data into different types, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, numerical
.
// and write it in parquet using different compression | ||
|
||
|
||
// Here are the type we want to compare |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: types
.
.option("compression", parquetCompressionName) | ||
.format("parquet") | ||
.mode("overwrite") | ||
.save(fileName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd reuse the mechanism for creating random and disposable temporary directories that is already present in the other notebooks. This way temporary files land in the predefined and already set up trash directory. See the other snippets for reference (they use an uuid, you can use the same imports and snippet).
.mapValues(_.map(_._1).sum).toSeq.sortBy(_._2) | ||
|
||
println("part* files sizes (in kB):") | ||
sizeOnDisk.foreach( o=>println ( s"${o._1}\t${o._2}\tkB")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// COMMAND ---------- | ||
|
||
// now we can also add a check on the effect of choosing a specific number type on the obtained values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rephrase this as follows:
// Now let's see how long it takes to read and process such data when using different compression and numerical data formats .
|
||
// COMMAND ---------- | ||
|
||
// Now you should also check how the variety of values affects compression : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd encourage you to add some conclusions here so that people reading the snippet can take action and understand what to expect with their change.
- better visualization of results with pivots - more comments
- better visualization of results with pivots - more comments
|
||
# Explanation | ||
There is difference in type when reading the data and type when writing it, causing a loss of compression perfomance. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add the # What to aim for concretely
section, as in other snippets? E.g. choose the right data type according to your needs and taking performance into account. You could reference the example you gave below.
Thanks for the changes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes!
Not 100% sure this leads used to the point we want to demonstrate.
Could you have a look at it ?