Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented May 8, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-7447

MetadataCache in ParquetRelation2 is annotated as @transient. When ParquetRelation2 is deserialized, we ask MetadataCache to refresh and perform schema merging again. It is time-consuming especially for very many parquet files.

With the new FSBasedParquetRelation, although MetadataCache is not @transient now, MetadataCache.refresh() still performs schema merging again when the relation is deserialized.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32246 has finished for PR 6012 at commit b0fc09b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@scwf
Copy link
Contributor

scwf commented May 9, 2015

good catch

@marmbrus
Copy link
Contributor

marmbrus commented May 9, 2015

I think this codepath is going to be deleted, but we should make sure that #5526 does not suffer from the same issue.

/cc @liancheng

@viirya
Copy link
Member Author

viirya commented May 10, 2015

@marmbrus After checking #5526, I think this part of codes is still be there to use. So maybe this pr is still needed.

@marmbrus
Copy link
Contributor

Yes, but look at the linked PR that replaces the parquet support with this
new API.
On May 9, 2015 7:18 PM, "Liang-Chi Hsieh" notifications@github.com wrote:

@marmbrus https://github.com/marmbrus After checking #5526
#5526, I think this part of codes
is still be there to use. So maybe this pr is still needed.


Reply to this email directly or view it on GitHub
#6012 (comment).

@viirya
Copy link
Member Author

viirya commented May 10, 2015

OK. I see. You meant liancheng#6.

@marmbrus
Copy link
Contributor

Yep!
On May 9, 2015 7:46 PM, "Liang-Chi Hsieh" notifications@github.com wrote:

OK. I see. You meant liancheng#6
liancheng#6.


Reply to this email directly or view it on GitHub
#6012 (comment).

…chema

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
@viirya
Copy link
Member Author

viirya commented May 14, 2015

@marmbrus #6090 was merged. But this problem is still there. I updated the codes. Please take a look. Thanks.

@marmbrus
Copy link
Contributor

LGTM

/cc @liancheng

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32691 has finished for PR 6012 at commit 6ac7d93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

@viirya Sorry for the late reply, I missed this PR while working on #6090. This fix LGTM, could you please rebase this PR? Notice that fsBasedParquet.scala was renamed back to newParquet.scala since the old ParquetRelation2 implementation was finally removed.

…chema

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
@viirya
Copy link
Member Author

viirya commented May 17, 2015

@liancheng updated. Please take a look.

@SparkQA
Copy link

SparkQA commented May 17, 2015

Test build #32927 has finished for PR 6012 at commit 2663957.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Thanks for fixing this! Merging to master and branch-1.4.

asfgit pushed a commit that referenced this pull request May 17, 2015
… deserialized

JIRA: https://issues.apache.org/jira/browse/SPARK-7447

`MetadataCache` in `ParquetRelation2` is annotated as `transient`. When `ParquetRelation2` is deserialized, we ask `MetadataCache` to refresh and perform schema merging again. It is time-consuming especially for very many parquet files.

With the new `FSBasedParquetRelation`, although `MetadataCache` is not `transient` now, `MetadataCache.refresh()` still performs schema merging again when the relation is deserialized.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6012 from viirya/without_remerge_schema and squashes the following commits:

2663957 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
6ac7d93 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
b0fc09b [Liang-Chi Hsieh] Don't generate and merge parquetSchema multiple times.

(cherry picked from commit 3399055)
Signed-off-by: Cheng Lian <lian@databricks.com>
@asfgit asfgit closed this in 3399055 May 17, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
… deserialized

JIRA: https://issues.apache.org/jira/browse/SPARK-7447

`MetadataCache` in `ParquetRelation2` is annotated as `transient`. When `ParquetRelation2` is deserialized, we ask `MetadataCache` to refresh and perform schema merging again. It is time-consuming especially for very many parquet files.

With the new `FSBasedParquetRelation`, although `MetadataCache` is not `transient` now, `MetadataCache.refresh()` still performs schema merging again when the relation is deserialized.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#6012 from viirya/without_remerge_schema and squashes the following commits:

2663957 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
6ac7d93 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
b0fc09b [Liang-Chi Hsieh] Don't generate and merge parquetSchema multiple times.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
… deserialized

JIRA: https://issues.apache.org/jira/browse/SPARK-7447

`MetadataCache` in `ParquetRelation2` is annotated as `transient`. When `ParquetRelation2` is deserialized, we ask `MetadataCache` to refresh and perform schema merging again. It is time-consuming especially for very many parquet files.

With the new `FSBasedParquetRelation`, although `MetadataCache` is not `transient` now, `MetadataCache.refresh()` still performs schema merging again when the relation is deserialized.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#6012 from viirya/without_remerge_schema and squashes the following commits:

2663957 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
6ac7d93 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
b0fc09b [Liang-Chi Hsieh] Don't generate and merge parquetSchema multiple times.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
… deserialized

JIRA: https://issues.apache.org/jira/browse/SPARK-7447

`MetadataCache` in `ParquetRelation2` is annotated as `transient`. When `ParquetRelation2` is deserialized, we ask `MetadataCache` to refresh and perform schema merging again. It is time-consuming especially for very many parquet files.

With the new `FSBasedParquetRelation`, although `MetadataCache` is not `transient` now, `MetadataCache.refresh()` still performs schema merging again when the relation is deserialized.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#6012 from viirya/without_remerge_schema and squashes the following commits:

2663957 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
6ac7d93 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
b0fc09b [Liang-Chi Hsieh] Don't generate and merge parquetSchema multiple times.
@viirya viirya deleted the without_remerge_schema branch December 27, 2023 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants