Skip to content

Conversation

@darkone23
Copy link

Resolves SPARK-7743.

Trivial changes of versions, package names, as well as a small issue in ParquetTableOperations.scala

-    val readContext = getReadSupport(configuration).init(
+    val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init(

Since ParquetInputFormat.getReadSupport was made package private in the latest release.

Thanks
-- Thomas Omans

@rxin
Copy link
Contributor

rxin commented Jun 2, 2015

Jenkins, test this please.

1 similar comment
@rxin
Copy link
Contributor

rxin commented Jun 3, 2015

Jenkins, test this please.

@rxin
Copy link
Contributor

rxin commented Jun 3, 2015

Do you mind updating the pull request title to say Parquet 1.7?

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34037 has finished for PR 6597 at commit 9e6ca82.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@darkone23 darkone23 changed the title [SPARK-7743] [SQL] Upgrading parquet version from incubator 1.6.0rc3 [SPARK-7743] [SQL] Parquet 1.7 Jun 3, 2015
@darkone23
Copy link
Author

Looks like I forgot to update the test code - will get to that ASAP.

@darkone23
Copy link
Author

Whoops, looks like I just missed one package name replacement somehow.

I force pushed a fix and flattened the history. Ready to test again :)

@rxin
Copy link
Contributor

rxin commented Jun 3, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34052 has finished for PR 6597 at commit e80e09e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@darkone23
Copy link
Author

Hm, not sure where parquet.ParquetRelation in SQLContext is coming from, but it's not org.apache.parquet - I'll dig a little deeper and see what is up.

@darkone23
Copy link
Author

Figured it out, that was a package local reference - I actually missed some string based classloading using the old package names.

Working on a fix now.

@darkone23
Copy link
Author

Fixed the pyspark examples and the Class.forName classloading in the test suites.

I've been having trouble running these integration tests on my laptop, apologies for the inconvenience.

@rxin
Copy link
Contributor

rxin commented Jun 3, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34060 has finished for PR 6597 at commit c88533b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@darkone23
Copy link
Author

Looks like I blindly replaced some other org.apache.spark.sql package local references from parquet.ClassName to org.apache.parquet.ClassName - this was really confusing, but should be fixed now. 🙏

@rxin
Copy link
Contributor

rxin commented Jun 3, 2015

Jenkins, ok to test.

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34067 has finished for PR 6597 at commit 2df0d1b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

retest this please

@liancheng
Copy link
Contributor

@saucam also opened PR #5889 for upgrading Parquet, but 1.7.0 hadn't been released then, so he was upgrading to 1.6.0. However, the better part of #5889 is that, it also cleans up some code we added to workaround some Parquet bugs. My suggestion is, we may first try to make this PR in since it upgrades to 1.7.0, and all those renaming stuff worth a separate PR. Then I can work with @saucam to bring #5889 up to date.

@saucam How do you think?

@saucam
Copy link

saucam commented Jun 3, 2015

hey @liancheng , sounds ok to me. We can rebase once these changes are merged.

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34078 has finished for PR 6597 at commit 2df0d1b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@darkone23
Copy link
Author

Those last two failures appear somewhat consistent, but jenkins isn't much help in showing what they are:

[error] Got a return code of 143 on line 211 of the run-tests script.

Looking at https://github.com/apache/spark/blob/master/dev/run-tests#L211 it appears to be failing on the grep clause:

    echo -e "q\n" \
      | build/sbt $SBT_MAVEN_PROFILES_ARGS "${SBT_MAVEN_TEST_ARGS[@]}" \
      | grep -v -e "info.*Resolving" -e "warn.*Merging" -e "info.*Including"

And yet no test failures reported through surefire.... Why would it be ensuring that those strings are in the output?

I'll try to pin down what is failing sometime today, any help is appreciated :)

edit: it looks like it is probably sbt that is exiting with 143 ... but no message as to why? is this a jenkins infrastructure error or an error in the parquet upgrade?

@srowen
Copy link
Member

srowen commented Jun 3, 2015

@vanzin did you say (offline) that you had some concerns about whether this ended up breaking something for user apps? or did you + Hari decide it wasn't a problem?

@vanzin
Copy link
Contributor

vanzin commented Jun 3, 2015

I did (and still do) have concerns, because upgrading a dependency changes the classpath user applications see. And now, suddenly, all the old parquet.* packages are going to disappear, which may affect existing user applications.

I don't know enough about Spark's use of parquet to know whether any parquet types end up leaking through the API somehow, though. If that case doesn't exist, then this change would be a little more palatable. But, generally, upgrading dependencies like this is a little bit sketchy from a user's perspective.

@darkone23
Copy link
Author

@vanzin - if people are depending on parquet transitively through spark then they simply need to make their dependency explicit, but yes - sparks public facing signatures that expose parquet types will need to be updated downstream in user code in order to compile.

This is inevitable unless spark intends to stay on an incubating release candidate version of parquet until some major version change like 2.0 - and is willing to accommodate the parquet bugs that come along with that.

Please be advised of the associated risk, but I still believe upgrading parquet is a good idea.

@rxin
Copy link
Contributor

rxin commented Jun 3, 2015

We don't expose Parquet in the API. If other projects depended on Parquet, the appropriate thing for them to do is to declare explicitly, which in this case would be fine.

The current Parquet version is way to buggy (i.e. no filter pushdown). We definitely need to upgrade.

@pwendell
Copy link
Contributor

pwendell commented Jun 4, 2015

@rxin user apps aren't isolated though, so it isn't free to upgrade dependencies. But I agree we should upgrade this one, it's buggy and a pretty old version of a fast moving library.

We could also shade parquet down the road. In my experience though we tend to have way more issues with mature and super widely used libraries like jetty.

For young libraries where consumers are anyways constantly upgrading because they are buggy, these haven't in the past been the ones causing users the most trouble.

@rxin
Copy link
Contributor

rxin commented Jun 4, 2015

It's a different package name, so users can even use older versions of Parquet.

@pwendell
Copy link
Contributor

pwendell commented Jun 4, 2015

Oh I see- then it really doesn't interfere. Users can just add the existing one.

@liancheng
Copy link
Contributor

We definitely want to upgrade this one, since it has some major bugs that causes filter push-down completely unusable, which causes noticeable negative impacts on performance.

@eggsby As for the build failure, return code 143 means the process got killed by a SIGTERM. I guess it's because the build timed out. Let's just have another try first.

@liancheng
Copy link
Contributor

retest this please

@liancheng
Copy link
Contributor

@eggsby I meant, I don't think the build failure is caused by bumping Parquet version.

@SparkQA
Copy link

SparkQA commented Jun 4, 2015

Test build #34176 has finished for PR 6597 at commit 2df0d1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 4, 2015

Test build #34177 has finished for PR 6597 at commit 2df0d1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 4, 2015

Thanks. I'm going to merge this into master.

@oliviertoupin
Copy link

To late for this to make it in Spark 1.4? Seems it would fix a lot of small things.

@darkone23
Copy link
Author

Thanks all.

🎈 🌟 🎉

@rxin
Copy link
Contributor

rxin commented Jun 4, 2015

Sorry this won't make it into 1.4. Dependency bumps are risky in general and we have already cut release candidates for 1.4.

You can however do a special build yourself with this though.

@JoshRosen
Copy link
Contributor

Looking through some of the Jenkins pull request builder logs, I've noticed some noisier log output from Parquet:

ileReader: Initiating action with parallelism: 5
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: reading summary file: file:/tmp/spark-a698f45b-c33a-4601-abdc-ca878a2fa499/test_insert_parquet/_common_metadata
Jun 4, 2015 3:04:45 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Jun 4, 2015 3:04:45 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 5 records.
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 5
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 5 records.
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 1 ms. row count = 5
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: GZIP
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Jun 4, 2015 3:04:45 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Jun 4, 2015 3:04:45 PM INFO: 

Can someone file a JIRA and investigate?

@vanzin
Copy link
Contributor

vanzin commented Jun 4, 2015

Seems like the log4j.properties for sql/ needs to be updated to match the new package names.

@liancheng
Copy link
Contributor

@JoshRosen @vanzin Filed SPARK-8118 for this. Adjusting ParquetRelation.enableLogForwarding() should fix this.

@kostya-sh
Copy link

BTW enableLogForwarding() should be updated as well, because name of the logger has been changed to org.apache.parquet. From parquet-mr Log:

// add a default handler in case there is none
Logger logger = Logger.getLogger(Log.class.getPackage().getName());

Another problem with enableLogForwarding() is that it doesn't hold to created loggers that can be garbage collected and all configuration changes will be gone. From https://docs.oracle.com/javase/6/docs/api/java/util/logging/Logger.html javadocs: It is important to note that the Logger returned by one of the getLogger factory methods may be garbage collected at any time if a strong reference to the Logger is not kept.

I've created https://issues.apache.org/jira/browse/SPARK-8122

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743).

Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala`

```diff
-    val readContext = getReadSupport(configuration).init(
+    val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init(
```

Since ParquetInputFormat.getReadSupport was made package private in the latest release.

Thanks
-- Thomas Omans

Author: Thomas Omans <tomans@cj.com>

Closes apache#6597 from eggsby/SPARK-7743 and squashes the following commits:

2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743).

Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala`

```diff
-    val readContext = getReadSupport(configuration).init(
+    val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init(
```

Since ParquetInputFormat.getReadSupport was made package private in the latest release.

Thanks
-- Thomas Omans

Author: Thomas Omans <tomans@cj.com>

Closes apache#6597 from eggsby/SPARK-7743 and squashes the following commits:

2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0
@liancheng liancheng deleted the SPARK-7743 branch July 14, 2015 00:20
mingyukim pushed a commit to palantir/spark that referenced this pull request Aug 17, 2015
Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743).

Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala`

```diff
-    val readContext = getReadSupport(configuration).init(
+    val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init(
```

Since ParquetInputFormat.getReadSupport was made package private in the latest release.

Thanks
-- Thomas Omans

Author: Thomas Omans <tomans@cj.com>

Closes apache#6597 from eggsby/SPARK-7743 and squashes the following commits:

2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.