Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add snippet for impacts on parquet file of data type and compression algorithm #26

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ThAccart
Copy link
Contributor

@ThAccart ThAccart commented Nov 8, 2024

Not 100% sure this leads used to the point we want to demonstrate.
Could you have a look at it ?

…ata, depending on chosen compression algorithm and chosen data type.
// COMMAND ----------

/*
This snippet shows who data type for numerical information and compression can affect Spark.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: how.

This snippet shows who data type for numerical information and compression can affect Spark.

# Symptom
Storage needs does not match with expectations, for example is higher in output after filtering than in input.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, needs do not match....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest for example volume is higher.

Storage needs does not match with expectations, for example is higher in output after filtering than in input.

# Explanation
There is difference in in type when reading the data and type when writing it, causing a loss of compression perfomance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, double in.


// COMMAND ----------

// We are going to demonstrate our purpose by converting the same numerica data into different types,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, numerical.

// and write it in parquet using different compression


// Here are the type we want to compare
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: types.

.option("compression", parquetCompressionName)
.format("parquet")
.mode("overwrite")
.save(fileName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd reuse the mechanism for creating random and disposable temporary directories that is already present in the other notebooks. This way temporary files land in the predefined and already set up trash directory. See the other snippets for reference (they use an uuid, you can use the same imports and snippet).

.mapValues(_.map(_._1).sum).toSeq.sortBy(_._2)

println("part* files sizes (in kB):")
sizeOnDisk.foreach( o=>println ( s"${o._1}\t${o._2}\tkB"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using implicits, you can use toDF to display a more friendly table:
image


// COMMAND ----------

// now we can also add a check on the effect of choosing a specific number type on the obtained values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rephrase this as follows:

// Now let's see how long it takes to read and process such data when using different compression and numerical data formats .


// COMMAND ----------

// Now you should also check how the variety of values affects compression :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd encourage you to add some conclusions here so that people reading the snippet can take action and understand what to expect with their change.

- better visualization of results with pivots
- more comments
- better visualization of results with pivots
- more comments

# Explanation
There is difference in type when reading the data and type when writing it, causing a loss of compression perfomance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add the # What to aim for concretely section, as in other snippets? E.g. choose the right data type according to your needs and taking performance into account. You could reference the example you gave below.

@mauriciojost
Copy link
Contributor

Thanks for the changes!

Copy link
Contributor

@mauriciojost mauriciojost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants