[SPARK-33449][SQL] Support File Metadata Cache for Parquet #30483

LuciferYang · 2020-11-24T11:47:18Z

What changes were proposed in this pull request?

The main purpose of this pr is to introduce the File Meta Cache mechanism for Spark SQL and the basic File Meta Cache implementation for Parquet is provided at the same time.

The main change of this pr as follows:

Defined a FileMetaCacheManager to cache the mapping FileMetaKey to FileMeta. The FileMetaKey is the cache key, equals is determined by the file path by default. The FileMeta used to represent the cache value and It is generated by the FileMetaKey#getFileMeta method.
Currently, the FileMetaCacheManager supports a simple cache expiration elimination mechanism, and the expiration time is determined by the new config FILE_META_CACHE_TTL_SINCE_LAST_ACCESS
For Parquet file format, this pr added ParquetFileMetaKey and ParquetFileMeta to cache Parquet file Footer and the Footer cache can be used by Vectorized read scene both in DS API V1 and V2, the feature will be enabled when FILE_META_CACHE_PARQUET_ENABLED is true

Currently, the file meta cache mechanism cannot be used by RowBasedReader, and it needs the completion of PARQUET-1965 for further support.

Why are the changes needed?

Support Parquet datasource use File Meta Cache mechanism to reduce the times of metadata reads multiple queries are performed on the same dataset.

Does this PR introduce any user-facing change?

Add 3 new config:

FILE_META_CACHE_PARQUET_ENABLED(spark.sql.fileMetaCache.parquet.enabled) to indicate if enable parquet file meta cache mechanism
FILE_META_CACHE_TTL_SINCE_LAST_ACCESS(spark.sql.fileMetaCache.ttlSinceLastAccess) to represent Time-to-live for file metadata cache entry after last access, the unit is seconds.

How was this patch tested?

Pass the Jenkins or GitHub Action
Add new test suites to ParquetQuerySuite

LuciferYang · 2020-11-24T11:54:05Z

@wangyum WIP now, missing some configuration entries, test suite and orc file support

SparkQA · 2020-11-24T12:05:12Z

Test build #131655 has finished for PR 30483 at commit 8357771.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Since your fix is merged here,
could you fix Scala style and rebase to the master, @LuciferYang ?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

LuciferYang · 2020-11-25T02:57:52Z

Since your fix is merged here,
could you fix Scala style and rebase to the master, @LuciferYang ?

OK~ will do it later ~

SparkQA · 2020-11-25T08:19:07Z

Test build #131754 has finished for PR 30483 at commit 8bba51a.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TruncateTable(

wangyum · 2020-11-25T08:43:02Z

@LuciferYang It would be great if we had some benchmark numbers.

LuciferYang · 2020-11-25T09:24:10Z

@wangyum this is a very good suggestion ~

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java

SparkQA · 2020-11-25T13:55:00Z

Test build #131760 has finished for PR 30483 at commit 3e2db1a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileMeta.scala

SparkQA · 2020-11-25T15:44:21Z

Test build #131771 has finished for PR 30483 at commit 92d2f37.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-25T17:42:29Z

Test build #131775 has finished for PR 30483 at commit 44ca052.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-16T06:36:25Z

Test build #142494 has finished for PR 30483 at commit aace310.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

LuciferYang · 2021-08-16T07:01:21Z

@dongjoon-hyun Because #33748 gives an ORC-only pr and use a new JIRA SPARK-36516, I'll change this PR to Parquet-only

LuciferYang · 2021-08-16T07:06:13Z

There will be some duplicate codes in the two PR, and this part of the code will be synchronized after one of them is merged

SparkQA · 2021-08-16T07:20:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46992/

SparkQA · 2021-08-16T07:33:48Z

Test build #142498 has finished for PR 30483 at commit fa75a95.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-08-16T07:44:49Z

Got it. Thank you for moving forward this efforts, @LuciferYang .

SparkQA · 2021-08-16T07:57:44Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46992/

SparkQA · 2021-08-16T09:03:12Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47001/

SparkQA · 2021-08-16T09:12:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46999/

SparkQA · 2021-08-16T09:51:03Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46999/

SparkQA · 2021-08-16T10:11:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47006/

SparkQA · 2021-08-16T11:04:38Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47006/

SparkQA · 2021-08-16T12:40:41Z

Test build #142500 has finished for PR 30483 at commit 4c022d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-16T14:19:36Z

Test build #142505 has finished for PR 30483 at commit 104b125.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class FileMetaKey
case class ParquetFileMetaKey(path: Path, configuration: Configuration)
class ParquetFileMeta(val footer: ParquetMetadata) extends FileMeta

HyukjinKwon · 2021-08-17T02:35:53Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala


-    lazy val footerFileMetaData =
+    lazy val footerFileMetaData = if (parquetMetaCacheEnabled) {
+      ParquetFileMeta.readFooterFromCache(filePath, conf).getFileMetaData


What happen if the file is removed and replaced?

We can discuss it in #33748 first. I'll set this PR to draft first

Refer to apache#30483

sunchao

Thanks @LuciferYang for the PR. Correct me if I'm wrong but I feel this is only very useful in a cluster where Spark executors are reused across different queries, and even in that case we'll need to be very careful on cache invalidation, since the same file can be overwritten with different content (e.g., in Hive insert overwrite).

I noticed that Spark currently needs to read the footer twice: once in SpecificParquetRecordReaderBase and another in ParquetFileFormat or ParquetPartitionReaderFactory. This can be fixed separately with a much simpler approach.

sunchao · 2021-08-19T00:48:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .createWithDefault(false)

+  val FILE_META_CACHE_PARQUET_ENABLED = buildConf("spark.sql.fileMetaCache.parquet.enabled")
+    .doc("To indicate if enable parquet file meta cache, it is recommended to enabled " +


hmm curious whether this can help if your Spark queries is running as separate Spark jobs, where each of them may use different executors.

Yes, this feature does have limitations, NODE_LOCAL + thrift-server with interactive analysis should be the best scene. If the architecture is storage and computing are separated, we need to consider the task scheduling.

In fact, in the OAP project, the fileMetaCache is relies on dataCache(PROCESS_LOCAL)

sunchao · 2021-08-19T02:05:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    .booleanConf
+    .createWithDefault(false)
+
+  val FILE_META_CACHE_TTL_SINCE_LAST_ACCESS =


nit: maybe FILE_META_CACHE_TTL_SINCE_LAST_ACCESS_SEC and spark.sql.fileMetaCache.ttlSinceLastAccessSec so it's easier to know that the unit is second?

good suggestion

sunchao · 2021-08-19T02:16:33Z

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java

+    this.fileSchema = footer.getFileMetaData().getSchema();
+    FilterCompat.Filter filter = ParquetInputFormat.getFilter(configuration);
+    List<BlockMetaData> blocks =
+      RowGroupFilter.filterRowGroups(filter, footer.getBlocks(), fileSchema);


does this apply all the filter levels? e.g., stats, dictionary, and bloom filter.

I need to investigate it again

sunchao · 2021-08-19T04:30:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala

+  def getFileMeta: FileMeta
+  override def hashCode(): Int = path.hashCode
+  override def equals(other: Any): Boolean = other match {
+    case df: FileMetaKey => path.equals(df.path)


What if the same file gets replaced? how do we invalidate the cache? this is very common from my experience, e.g., Hive overwrite a partition.

This is a very good question, we discussed in #33748 (comment),

If the file name has the timestamp, I think we don't have to worry too much. The names of the new file and the old file are different and they can ensure that they don't read the wrong data. If it is manually file replaced and the file has the same name and the corresponding file meta exists in the cache, an incorrect file meta will be used to read the data. If the data reading fails, the job will fail. But if the data reading happens to be successful, the job will read the wrong data. In fact, even if there is no `FileMetaCache`, there is a similar risk in manually replace files with same name, because the offset and length of PartitionedFile maybe don't match after manually replace for a running job

And At the same time, I added a warning for this feature in SQLConf.

Now Parquet is a draft because the Deprecated API, We are focusing on ORC (SPARK-36516) now

Refer to apache#30483 (cherry picked from commit cb7852c)

1. Implement LocalDataCacheManager 2. base xiaoxiang's PR 3. Implement CacheFileScanRDD 4. Implement AbstractCacheFileSystem 5. Optimize performance 6. Support soft affinity for hdfs 7. Support ByteBuffer to read data, and avoid to read data one byte by one byte 8. Add File Metadata cache support for Parquet : Refer to apache#30483 9. Support to cache small files in memory : ByteBufferPageStore extends PageStore to support cache data in memory

LuciferYang added 4 commits November 24, 2020 19:39

add simple support for Parquet

d3d6e61

use a footer copy

c7c9736

remove copy method

f0ac389

rename FileMetaCacheManager

8357771

LuciferYang marked this pull request as draft November 24, 2020 11:47

fix format

0b0ecf4

github-actions bot added the SQL label Nov 24, 2020

dongjoon-hyun reviewed Nov 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

LuciferYang added 2 commits November 25, 2020 15:45

Merge branch 'upmaster' into SPARK-33449

8bba51a

parquet v2

a112791

fix format

3e2db1a

LuciferYang added 5 commits November 25, 2020 19:08

add ttl conf

92d2f37

add a test case

b8b45ec

add a test case

c63d7cb

add a test case

7ff0502

add a test case

44ca052

LuciferYang commented Nov 25, 2020

View reviewed changes

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java Outdated Show resolved Hide resolved

use table name

7254d88

Add meta cache support for orc

bc25c4e

LuciferYang commented Nov 25, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileMeta.scala Outdated Show resolved Hide resolved

LuciferYang changed the title ~~[SPARK-33449][SQL] Add File Metadata cache support for Parquet and Orc~~ [SPARK-33449][SQL] Add File Metadata cache support for Parquet Aug 16, 2021

LuciferYang changed the title ~~[SPARK-33449][SQL] Add File Metadata cache support for Parquet~~ [SPARK-33449][SQL] Add File Metadata cache support for Parquet or ORC Aug 16, 2021

to parquet only pr

fa75a95

LuciferYang changed the title ~~[SPARK-33449][SQL] Add File Metadata cache support for Parquet or ORC~~ [SPARK-33449][SQL] Add File Metadata cache support for Parquet Aug 16, 2021

remove unused import

4c022d7

remove private sql and add comments

104b125

HyukjinKwon reviewed Aug 17, 2021

View reviewed changes

LuciferYang marked this pull request as draft August 17, 2021 02:51

zzcclp added a commit to zzcclp/spark that referenced this pull request Aug 17, 2021

Add File Metadata cache support for Parquet

76efc26

Refer to apache#30483

dongjoon-hyun changed the title ~~[SPARK-33449][SQL] Add File Metadata cache support for Parquet~~ [SPARK-33449][SQL] Support File Metadata Cache for Parquet Aug 17, 2021

sunchao reviewed Aug 19, 2021

View reviewed changes

zzcclp added a commit to Kyligence/spark that referenced this pull request Sep 4, 2021

Add File Metadata cache support for Parquet

3a62455

Refer to apache#30483 (cherry picked from commit cb7852c)

LuciferYang closed this Oct 19, 2021

LuciferYang deleted the SPARK-33449 branch October 22, 2023 07:34

[SPARK-33449][SQL] Support File Metadata Cache for Parquet #30483

[SPARK-33449][SQL] Support File Metadata Cache for Parquet #30483

Uh oh!

Conversation

LuciferYang commented Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

LuciferYang commented Nov 24, 2020

Uh oh!

SparkQA commented Nov 24, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LuciferYang commented Nov 25, 2020

Uh oh!

SparkQA commented Nov 25, 2020

Uh oh!

wangyum commented Nov 25, 2020

Uh oh!

LuciferYang commented Nov 25, 2020

Uh oh!

Uh oh!

SparkQA commented Nov 25, 2020

Uh oh!

Uh oh!

SparkQA commented Nov 25, 2020

Uh oh!

SparkQA commented Nov 25, 2020

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

LuciferYang commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang commented Aug 16, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

dongjoon-hyun commented Aug 16, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

SparkQA commented Aug 16, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

LuciferYang commented Nov 24, 2020 •

edited

Loading

LuciferYang commented Aug 16, 2021 •

edited

Loading

LuciferYang Aug 19, 2021 •

edited

Loading