[SPARK-7743] [SQL] Parquet 1.7 #6597

darkone23 · 2015-06-02T22:27:05Z

Trivial changes of versions, package names, as well as a small issue in ParquetTableOperations.scala

-    val readContext = getReadSupport(configuration).init(
+    val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init(

Since ParquetInputFormat.getReadSupport was made package private in the latest release.

Thanks
-- Thomas Omans

rxin · 2015-06-02T23:50:55Z

Jenkins, test this please.

rxin · 2015-06-03T00:13:13Z

Jenkins, test this please.

rxin · 2015-06-03T00:13:28Z

Do you mind updating the pull request title to say Parquet 1.7?

SparkQA · 2015-06-03T01:34:07Z

Test build #34037 has finished for PR 6597 at commit 9e6ca82.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

darkone23 · 2015-06-03T01:56:39Z

Looks like I forgot to update the test code - will get to that ASAP.

darkone23 · 2015-06-03T02:42:31Z

Whoops, looks like I just missed one package name replacement somehow.

I force pushed a fix and flattened the history. Ready to test again :)

rxin · 2015-06-03T02:43:36Z

Jenkins, retest this please.

SparkQA · 2015-06-03T02:53:42Z

Test build #34052 has finished for PR 6597 at commit e80e09e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

darkone23 · 2015-06-03T03:05:30Z

Hm, not sure where parquet.ParquetRelation in SQLContext is coming from, but it's not org.apache.parquet - I'll dig a little deeper and see what is up.

darkone23 · 2015-06-03T03:30:07Z

Figured it out, that was a package local reference - I actually missed some string based classloading using the old package names.

Working on a fix now.

darkone23 · 2015-06-03T03:40:42Z

Fixed the pyspark examples and the Class.forName classloading in the test suites.

I've been having trouble running these integration tests on my laptop, apologies for the inconvenience.

rxin · 2015-06-03T04:01:26Z

Jenkins, retest this please.

SparkQA · 2015-06-03T05:12:02Z

Test build #34060 has finished for PR 6597 at commit c88533b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

darkone23 · 2015-06-03T05:31:03Z

Looks like I blindly replaced some other org.apache.spark.sql package local references from parquet.ClassName to org.apache.parquet.ClassName - this was really confusing, but should be fixed now. 🙏

rxin · 2015-06-03T05:33:03Z

Jenkins, ok to test.

SparkQA · 2015-06-03T07:16:41Z

Test build #34067 has finished for PR 6597 at commit 2df0d1b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-06-03T08:04:30Z

retest this please

liancheng · 2015-06-03T08:07:19Z

@saucam also opened PR #5889 for upgrading Parquet, but 1.7.0 hadn't been released then, so he was upgrading to 1.6.0. However, the better part of #5889 is that, it also cleans up some code we added to workaround some Parquet bugs. My suggestion is, we may first try to make this PR in since it upgrades to 1.7.0, and all those renaming stuff worth a separate PR. Then I can work with @saucam to bring #5889 up to date.

@saucam How do you think?

saucam · 2015-06-03T09:29:55Z

hey @liancheng , sounds ok to me. We can rebase once these changes are merged.

SparkQA · 2015-06-03T09:53:21Z

Test build #34078 has finished for PR 6597 at commit 2df0d1b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

darkone23 · 2015-06-03T11:07:52Z

Those last two failures appear somewhat consistent, but jenkins isn't much help in showing what they are:

[error] Got a return code of 143 on line 211 of the run-tests script.

Looking at https://github.com/apache/spark/blob/master/dev/run-tests#L211 it appears to be failing on the grep clause:

    echo -e "q\n" \
      | build/sbt $SBT_MAVEN_PROFILES_ARGS "${SBT_MAVEN_TEST_ARGS[@]}" \
      | grep -v -e "info.*Resolving" -e "warn.*Merging" -e "info.*Including"

And yet no test failures reported through surefire.... Why would it be ensuring that those strings are in the output?

I'll try to pin down what is failing sometime today, any help is appreciated :)

edit: it looks like it is probably sbt that is exiting with 143 ... but no message as to why? is this a jenkins infrastructure error or an error in the parquet upgrade?

srowen · 2015-06-03T11:28:51Z

@vanzin did you say (offline) that you had some concerns about whether this ended up breaking something for user apps? or did you + Hari decide it wasn't a problem?

vanzin · 2015-06-03T16:44:32Z

I did (and still do) have concerns, because upgrading a dependency changes the classpath user applications see. And now, suddenly, all the old parquet.* packages are going to disappear, which may affect existing user applications.

I don't know enough about Spark's use of parquet to know whether any parquet types end up leaking through the API somehow, though. If that case doesn't exist, then this change would be a little more palatable. But, generally, upgrading dependencies like this is a little bit sketchy from a user's perspective.

darkone23 · 2015-06-03T17:06:41Z

@vanzin - if people are depending on parquet transitively through spark then they simply need to make their dependency explicit, but yes - sparks public facing signatures that expose parquet types will need to be updated downstream in user code in order to compile.

This is inevitable unless spark intends to stay on an incubating release candidate version of parquet until some major version change like 2.0 - and is willing to accommodate the parquet bugs that come along with that.

Please be advised of the associated risk, but I still believe upgrading parquet is a good idea.

rxin · 2015-06-03T17:41:11Z

We don't expose Parquet in the API. If other projects depended on Parquet, the appropriate thing for them to do is to declare explicitly, which in this case would be fine.

The current Parquet version is way to buggy (i.e. no filter pushdown). We definitely need to upgrade.

pwendell · 2015-06-04T01:03:46Z

@rxin user apps aren't isolated though, so it isn't free to upgrade dependencies. But I agree we should upgrade this one, it's buggy and a pretty old version of a fast moving library.

We could also shade parquet down the road. In my experience though we tend to have way more issues with mature and super widely used libraries like jetty.

For young libraries where consumers are anyways constantly upgrading because they are buggy, these haven't in the past been the ones causing users the most trouble.

rxin · 2015-06-04T01:06:54Z

It's a different package name, so users can even use older versions of Parquet.

pwendell · 2015-06-04T01:15:10Z

Oh I see- then it really doesn't interfere. Users can just add the existing one.

liancheng · 2015-06-04T09:02:03Z

We definitely want to upgrade this one, since it has some major bugs that causes filter push-down completely unusable, which causes noticeable negative impacts on performance.

@eggsby As for the build failure, return code 143 means the process got killed by a SIGTERM. I guess it's because the build timed out. Let's just have another try first.

liancheng · 2015-06-04T09:02:07Z

retest this please

liancheng · 2015-06-04T09:02:53Z

@eggsby I meant, I don't think the build failure is caused by bumping Parquet version.

SparkQA · 2015-06-04T11:32:05Z

Test build #34176 has finished for PR 6597 at commit 2df0d1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-04T11:47:40Z

Test build #34177 has finished for PR 6597 at commit 2df0d1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-06-04T18:31:51Z

Thanks. I'm going to merge this into master.

oliviertoupin · 2015-06-04T19:20:03Z

To late for this to make it in Spark 1.4? Seems it would fix a lot of small things.

darkone23 · 2015-06-04T19:23:33Z

Thanks all.

🎈 🌟 🎉

rxin · 2015-06-04T19:39:37Z

Sorry this won't make it into 1.4. Dependency bumps are risky in general and we have already cut release candidates for 1.4.

You can however do a special build yourself with this though.

JoshRosen · 2015-06-04T22:05:43Z

Looking through some of the Jenkins pull request builder logs, I've noticed some noisier log output from Parquet:

ileReader: Initiating action with parallelism: 5
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: reading summary file: file:/tmp/spark-a698f45b-c33a-4601-abdc-ca878a2fa499/test_insert_parquet/_common_metadata
Jun 4, 2015 3:04:45 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Jun 4, 2015 3:04:45 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 5 records.
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 5
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 5 records.
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 1 ms. row count = 5
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: GZIP
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Jun 4, 2015 3:04:45 PM INFO:

Can someone file a JIRA and investigate?

vanzin · 2015-06-04T22:09:16Z

Seems like the log4j.properties for sql/ needs to be updated to match the new package names.

liancheng · 2015-06-05T01:33:48Z

@JoshRosen @vanzin Filed SPARK-8118 for this. Adjusting ParquetRelation.enableLogForwarding() should fix this.

kostya-sh · 2015-06-05T02:41:14Z

BTW enableLogForwarding() should be updated as well, because name of the logger has been changed to org.apache.parquet. From parquet-mr Log:

// add a default handler in case there is none
Logger logger = Logger.getLogger(Log.class.getPackage().getName());

Another problem with enableLogForwarding() is that it doesn't hold to created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept.

I've created https://issues.apache.org/jira/browse/SPARK-8122

Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743). Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala` ```diff - val readContext = getReadSupport(configuration).init( + val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( ``` Since ParquetInputFormat.getReadSupport was made package private in the latest release. Thanks -- Thomas Omans Author: Thomas Omans <tomans@cj.com> Closes apache#6597 from eggsby/SPARK-7743 and squashes the following commits: 2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0

Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743). Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala` ```diff - val readContext = getReadSupport(configuration).init( + val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( ``` Since ParquetInputFormat.getReadSupport was made package private in the latest release. Thanks -- Thomas Omans Author: Thomas Omans <tomans@cj.com> Closes apache#6597 from eggsby/SPARK-7743 and squashes the following commits: 2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0 Conflicts: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala

darkone23 changed the title ~~[SPARK-7743] [SQL] Upgrading parquet version from incubator 1.6.0rc3~~ [SPARK-7743] [SQL] Parquet 1.7 Jun 3, 2015

darkone23 force-pushed the SPARK-7743 branch from 9e6ca82 to e80e09e Compare June 3, 2015 02:38

darkone23 force-pushed the SPARK-7743 branch from e80e09e to c88533b Compare June 3, 2015 03:39

darkone23 force-pushed the SPARK-7743 branch from c88533b to 2df0d1b Compare June 3, 2015 05:28

[SPARK-7743] [SQL] Upgrading parquet version to 1.7.0

2df0d1b

asfgit closed this in cd3176b Jun 4, 2015

JoshRosen mentioned this pull request Jun 4, 2015

Update to Parquet 1.6.0 Final #6649

Closed

liancheng deleted the SPARK-7743 branch July 14, 2015 00:20

[SPARK-7743] [SQL] Parquet 1.7 #6597

[SPARK-7743] [SQL] Parquet 1.7 #6597

Uh oh!

Conversation

darkone23 commented Jun 2, 2015

Uh oh!

rxin commented Jun 2, 2015

Uh oh!

rxin commented Jun 3, 2015

Uh oh!

rxin commented Jun 3, 2015

Uh oh!

SparkQA commented Jun 3, 2015

Uh oh!

darkone23 commented Jun 3, 2015

Uh oh!

darkone23 commented Jun 3, 2015

Uh oh!

rxin commented Jun 3, 2015

Uh oh!

SparkQA commented Jun 3, 2015

Uh oh!

darkone23 commented Jun 3, 2015

Uh oh!

darkone23 commented Jun 3, 2015

Uh oh!

darkone23 commented Jun 3, 2015

Uh oh!

rxin commented Jun 3, 2015

Uh oh!

SparkQA commented Jun 3, 2015

Uh oh!

darkone23 commented Jun 3, 2015

Uh oh!

rxin commented Jun 3, 2015

Uh oh!

SparkQA commented Jun 3, 2015

Uh oh!

liancheng commented Jun 3, 2015

Uh oh!

liancheng commented Jun 3, 2015

Uh oh!

saucam commented Jun 3, 2015

Uh oh!

SparkQA commented Jun 3, 2015

Uh oh!

darkone23 commented Jun 3, 2015

Uh oh!

srowen commented Jun 3, 2015

Uh oh!

vanzin commented Jun 3, 2015

Uh oh!

darkone23 commented Jun 3, 2015

Uh oh!

rxin commented Jun 3, 2015

Uh oh!

pwendell commented Jun 4, 2015

Uh oh!

rxin commented Jun 4, 2015

Uh oh!

pwendell commented Jun 4, 2015

Uh oh!

liancheng commented Jun 4, 2015

Uh oh!

liancheng commented Jun 4, 2015

Uh oh!

liancheng commented Jun 4, 2015

Uh oh!

SparkQA commented Jun 4, 2015

Uh oh!

SparkQA commented Jun 4, 2015

Uh oh!

rxin commented Jun 4, 2015

Uh oh!

oliviertoupin commented Jun 4, 2015

Uh oh!

darkone23 commented Jun 4, 2015

Uh oh!

rxin commented Jun 4, 2015

Uh oh!