-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-7743] [SQL] Parquet 1.7 #6597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins, test this please. |
1 similar comment
|
Jenkins, test this please. |
|
Do you mind updating the pull request title to say Parquet 1.7? |
|
Test build #34037 has finished for PR 6597 at commit
|
|
Looks like I forgot to update the test code - will get to that ASAP. |
|
Whoops, looks like I just missed one package name replacement somehow. I force pushed a fix and flattened the history. Ready to test again :) |
|
Jenkins, retest this please. |
|
Test build #34052 has finished for PR 6597 at commit
|
|
Hm, not sure where |
|
Figured it out, that was a package local reference - I actually missed some string based classloading using the old package names. Working on a fix now. |
|
Fixed the pyspark examples and the Class.forName classloading in the test suites. I've been having trouble running these integration tests on my laptop, apologies for the inconvenience. |
|
Jenkins, retest this please. |
|
Test build #34060 has finished for PR 6597 at commit
|
|
Looks like I blindly replaced some other |
|
Jenkins, ok to test. |
|
Test build #34067 has finished for PR 6597 at commit
|
|
retest this please |
|
@saucam also opened PR #5889 for upgrading Parquet, but 1.7.0 hadn't been released then, so he was upgrading to 1.6.0. However, the better part of #5889 is that, it also cleans up some code we added to workaround some Parquet bugs. My suggestion is, we may first try to make this PR in since it upgrades to 1.7.0, and all those renaming stuff worth a separate PR. Then I can work with @saucam to bring #5889 up to date. @saucam How do you think? |
|
hey @liancheng , sounds ok to me. We can rebase once these changes are merged. |
|
Test build #34078 has finished for PR 6597 at commit
|
|
Those last two failures appear somewhat consistent, but jenkins isn't much help in showing what they are: Looking at https://github.com/apache/spark/blob/master/dev/run-tests#L211 it appears to be failing on the grep clause: echo -e "q\n" \
| build/sbt $SBT_MAVEN_PROFILES_ARGS "${SBT_MAVEN_TEST_ARGS[@]}" \
| grep -v -e "info.*Resolving" -e "warn.*Merging" -e "info.*Including"And yet no test failures reported through surefire.... Why would it be ensuring that those strings are in the output? I'll try to pin down what is failing sometime today, any help is appreciated :) edit: it looks like it is probably sbt that is exiting with |
|
@vanzin did you say (offline) that you had some concerns about whether this ended up breaking something for user apps? or did you + Hari decide it wasn't a problem? |
|
I did (and still do) have concerns, because upgrading a dependency changes the classpath user applications see. And now, suddenly, all the old I don't know enough about Spark's use of parquet to know whether any parquet types end up leaking through the API somehow, though. If that case doesn't exist, then this change would be a little more palatable. But, generally, upgrading dependencies like this is a little bit sketchy from a user's perspective. |
|
@vanzin - if people are depending on parquet transitively through spark then they simply need to make their dependency explicit, but yes - sparks public facing signatures that expose parquet types will need to be updated downstream in user code in order to compile. This is inevitable unless spark intends to stay on an incubating release candidate version of parquet until some major version change like 2.0 - and is willing to accommodate the parquet bugs that come along with that. Please be advised of the associated risk, but I still believe upgrading parquet is a good idea. |
|
We don't expose Parquet in the API. If other projects depended on Parquet, the appropriate thing for them to do is to declare explicitly, which in this case would be fine. The current Parquet version is way to buggy (i.e. no filter pushdown). We definitely need to upgrade. |
|
@rxin user apps aren't isolated though, so it isn't free to upgrade dependencies. But I agree we should upgrade this one, it's buggy and a pretty old version of a fast moving library. We could also shade parquet down the road. In my experience though we tend to have way more issues with mature and super widely used libraries like jetty. For young libraries where consumers are anyways constantly upgrading because they are buggy, these haven't in the past been the ones causing users the most trouble. |
|
It's a different package name, so users can even use older versions of Parquet. |
|
Oh I see- then it really doesn't interfere. Users can just add the existing one. |
|
We definitely want to upgrade this one, since it has some major bugs that causes filter push-down completely unusable, which causes noticeable negative impacts on performance. @eggsby As for the build failure, return code 143 means the process got killed by a |
|
retest this please |
|
@eggsby I meant, I don't think the build failure is caused by bumping Parquet version. |
|
Test build #34176 has finished for PR 6597 at commit
|
|
Test build #34177 has finished for PR 6597 at commit
|
|
Thanks. I'm going to merge this into master. |
|
To late for this to make it in Spark 1.4? Seems it would fix a lot of small things. |
|
Thanks all. 🎈 🌟 🎉 |
|
Sorry this won't make it into 1.4. Dependency bumps are risky in general and we have already cut release candidates for 1.4. You can however do a special build yourself with this though. |
|
Looking through some of the Jenkins pull request builder logs, I've noticed some noisier log output from Parquet: Can someone file a JIRA and investigate? |
|
Seems like the log4j.properties for sql/ needs to be updated to match the new package names. |
|
@JoshRosen @vanzin Filed SPARK-8118 for this. Adjusting |
|
BTW Another problem with I've created https://issues.apache.org/jira/browse/SPARK-8122 |
Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743). Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala` ```diff - val readContext = getReadSupport(configuration).init( + val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( ``` Since ParquetInputFormat.getReadSupport was made package private in the latest release. Thanks -- Thomas Omans Author: Thomas Omans <tomans@cj.com> Closes apache#6597 from eggsby/SPARK-7743 and squashes the following commits: 2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0
Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743). Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala` ```diff - val readContext = getReadSupport(configuration).init( + val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( ``` Since ParquetInputFormat.getReadSupport was made package private in the latest release. Thanks -- Thomas Omans Author: Thomas Omans <tomans@cj.com> Closes apache#6597 from eggsby/SPARK-7743 and squashes the following commits: 2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0
Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743). Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala` ```diff - val readContext = getReadSupport(configuration).init( + val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( ``` Since ParquetInputFormat.getReadSupport was made package private in the latest release. Thanks -- Thomas Omans Author: Thomas Omans <tomans@cj.com> Closes apache#6597 from eggsby/SPARK-7743 and squashes the following commits: 2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0 Conflicts: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala
Resolves SPARK-7743.
Trivial changes of versions, package names, as well as a small issue in
ParquetTableOperations.scalaSince ParquetInputFormat.getReadSupport was made package private in the latest release.
Thanks
-- Thomas Omans