Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Nov 24, 2020

What changes were proposed in this pull request?

The main purpose of this pr is to introduce the File Meta Cache mechanism for Spark SQL and the basic File Meta Cache implementation for Parquet is provided at the same time.

The main change of this pr as follows:

  • Defined a FileMetaCacheManager to cache the mapping FileMetaKey to FileMeta. The FileMetaKey is the cache key, equals is determined by the file path by default. The FileMeta used to represent the cache value and It is generated by the FileMetaKey#getFileMeta method.

  • Currently, the FileMetaCacheManager supports a simple cache expiration elimination mechanism, and the expiration time is determined by the new config FILE_META_CACHE_TTL_SINCE_LAST_ACCESS

  • For Parquet file format, this pr added ParquetFileMetaKey and ParquetFileMeta to cache Parquet file Footer and the Footer cache can be used by Vectorized read scene both in DS API V1 and V2, the feature will be enabled when FILE_META_CACHE_PARQUET_ENABLED is true

Currently, the file meta cache mechanism cannot be used by RowBasedReader, and it needs the completion of PARQUET-1965 for further support.

Why are the changes needed?

Support Parquet datasource use File Meta Cache mechanism to reduce the times of metadata reads multiple queries are performed on the same dataset.

Does this PR introduce any user-facing change?

Add 3 new config:

  • FILE_META_CACHE_PARQUET_ENABLED(spark.sql.fileMetaCache.parquet.enabled) to indicate if enable parquet file meta cache mechanism
  • FILE_META_CACHE_TTL_SINCE_LAST_ACCESS(spark.sql.fileMetaCache.ttlSinceLastAccess) to represent Time-to-live for file metadata cache entry after last access, the unit is seconds.

How was this patch tested?

  • Pass the Jenkins or GitHub Action
  • Add new test suites to ParquetQuerySuite

@LuciferYang LuciferYang marked this pull request as draft November 24, 2020 11:47
@LuciferYang
Copy link
Contributor Author

@wangyum WIP now, missing some configuration entries, test suite and orc file support

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131655 has finished for PR 30483 at commit 8357771.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions github-actions bot added the SQL label Nov 24, 2020
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since your fix is merged here,
could you fix Scala style and rebase to the master, @LuciferYang ?

@LuciferYang
Copy link
Contributor Author

Since your fix is merged here,
could you fix Scala style and rebase to the master, @LuciferYang ?

OK~ will do it later ~

@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131754 has finished for PR 30483 at commit 8bba51a.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class TruncateTable(

@wangyum
Copy link
Member

wangyum commented Nov 25, 2020

@LuciferYang It would be great if we had some benchmark numbers.

@LuciferYang
Copy link
Contributor Author

@wangyum this is a very good suggestion ~

@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131760 has finished for PR 30483 at commit 3e2db1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131771 has finished for PR 30483 at commit 92d2f37.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131775 has finished for PR 30483 at commit 44ca052.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Test build #142494 has finished for PR 30483 at commit aace310.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Aug 16, 2021

@dongjoon-hyun Because #33748 gives an ORC-only pr and use a new JIRA SPARK-36516, I'll change this PR to Parquet-only

@LuciferYang LuciferYang changed the title [SPARK-33449][SQL] Add File Metadata cache support for Parquet and Orc [SPARK-33449][SQL] Add File Metadata cache support for Parquet Aug 16, 2021
@LuciferYang
Copy link
Contributor Author

There will be some duplicate codes in the two PR, and this part of the code will be synchronized after one of them is merged

@LuciferYang LuciferYang changed the title [SPARK-33449][SQL] Add File Metadata cache support for Parquet [SPARK-33449][SQL] Add File Metadata cache support for Parquet or ORC Aug 16, 2021
@LuciferYang LuciferYang changed the title [SPARK-33449][SQL] Add File Metadata cache support for Parquet or ORC [SPARK-33449][SQL] Add File Metadata cache support for Parquet Aug 16, 2021
@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46992/

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Test build #142498 has finished for PR 30483 at commit fa75a95.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Got it. Thank you for moving forward this efforts, @LuciferYang .

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46992/

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47001/

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46999/

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46999/

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47006/

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47006/

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Test build #142500 has finished for PR 30483 at commit 4c022d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2021

Test build #142505 has finished for PR 30483 at commit 104b125.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class FileMetaKey
  • case class ParquetFileMetaKey(path: Path, configuration: Configuration)
  • class ParquetFileMeta(val footer: ParquetMetadata) extends FileMeta


lazy val footerFileMetaData =
lazy val footerFileMetaData = if (parquetMetaCacheEnabled) {
ParquetFileMeta.readFooterFromCache(filePath, conf).getFileMetaData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happen if the file is removed and replaced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss it in #33748 first. I'll set this PR to draft first

@LuciferYang LuciferYang marked this pull request as draft August 17, 2021 02:51
zzcclp added a commit to zzcclp/spark that referenced this pull request Aug 17, 2021
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-33449][SQL] Add File Metadata cache support for Parquet [SPARK-33449][SQL] Support File Metadata Cache for Parquet Aug 17, 2021
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @LuciferYang for the PR. Correct me if I'm wrong but I feel this is only very useful in a cluster where Spark executors are reused across different queries, and even in that case we'll need to be very careful on cache invalidation, since the same file can be overwritten with different content (e.g., in Hive insert overwrite).

I noticed that Spark currently needs to read the footer twice: once in SpecificParquetRecordReaderBase and another in ParquetFileFormat or ParquetPartitionReaderFactory. This can be fixed separately with a much simpler approach.

.createWithDefault(false)

val FILE_META_CACHE_PARQUET_ENABLED = buildConf("spark.sql.fileMetaCache.parquet.enabled")
.doc("To indicate if enable parquet file meta cache, it is recommended to enabled " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm curious whether this can help if your Spark queries is running as separate Spark jobs, where each of them may use different executors.

Copy link
Contributor Author

@LuciferYang LuciferYang Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this feature does have limitations, NODE_LOCAL + thrift-server with interactive analysis should be the best scene. If the architecture is storage and computing are separated, we need to consider the task scheduling.

In fact, in the OAP project, the fileMetaCache is relies on dataCache(PROCESS_LOCAL)

.booleanConf
.createWithDefault(false)

val FILE_META_CACHE_TTL_SINCE_LAST_ACCESS =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe FILE_META_CACHE_TTL_SINCE_LAST_ACCESS_SEC and spark.sql.fileMetaCache.ttlSinceLastAccessSec so it's easier to know that the unit is second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion

this.fileSchema = footer.getFileMetaData().getSchema();
FilterCompat.Filter filter = ParquetInputFormat.getFilter(configuration);
List<BlockMetaData> blocks =
RowGroupFilter.filterRowGroups(filter, footer.getBlocks(), fileSchema);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this apply all the filter levels? e.g., stats, dictionary, and bloom filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to investigate it again

def getFileMeta: FileMeta
override def hashCode(): Int = path.hashCode
override def equals(other: Any): Boolean = other match {
case df: FileMetaKey => path.equals(df.path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the same file gets replaced? how do we invalidate the cache? this is very common from my experience, e.g., Hive overwrite a partition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good question, we discussed in #33748 (comment),

If the file name has the timestamp, I think we don't have to worry too much. The names of the new file and the old file are different and they can ensure that they don't read the wrong data.

If it is manually file replaced and the file has the same name and the corresponding file meta exists in the cache, an incorrect file meta will be used to read the data. If the data reading fails, the job will fail. But if the data reading happens to be successful, the job will read the wrong data.

In fact, even if there is no `FileMetaCache`, there is a similar risk in manually replace files with same name, because the offset and length of PartitionedFile maybe don't match after manually replace for a running job

And At the same time, I added a warning for this feature in SQLConf.

Now Parquet is a draft because the Deprecated API, We are focusing on ORC (SPARK-36516) now

zzcclp added a commit to Kyligence/spark that referenced this pull request Sep 4, 2021
zzcclp added a commit to zzcclp/spark that referenced this pull request Sep 20, 2021
1. Implement LocalDataCacheManager
2. base xiaoxiang's PR
3. Implement CacheFileScanRDD
4. Implement AbstractCacheFileSystem
5. Optimize performance
6. Support soft affinity for hdfs
7. Support ByteBuffer to read data, and avoid to read data one byte by one byte
8. Add File Metadata cache support for Parquet : Refer to apache#30483
9. Support to cache small files in memory : ByteBufferPageStore extends PageStore to support cache data in memory
@LuciferYang LuciferYang deleted the SPARK-33449 branch October 22, 2023 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants