You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is because Hadoop/Spark systems can not distribute a job when data is loaded from GZip files. GZip is not a 'splittable' format. So for example in Spark, after loading a GZip file one has to repartition the RDD to split it line-by-line.
This is done automatically using the bzip2 format.
S3 is a common data source for Hadoop/Spark jobs (straightforward use case with AWS EMR) so having bzip2 support would be essential. Other data ingestion tools like Apache Flume supports bzip2 compression.
The text was updated successfully, but these errors were encountered:
This is because Hadoop/Spark systems can not distribute a job when data is loaded from GZip files. GZip is not a 'splittable' format. So for example in Spark, after loading a GZip file one has to repartition the RDD to split it line-by-line.
This is done automatically using the bzip2 format.
S3 is a common data source for Hadoop/Spark jobs (straightforward use case with AWS EMR) so having bzip2 support would be essential. Other data ingestion tools like Apache Flume supports bzip2 compression.
The text was updated successfully, but these errors were encountered: