-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-36516][SQL] Support File Metadata Cache for ORC #33748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #142495 has finished for PR 33748 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileMeta.scala
Show resolved
Hide resolved
|
Test build #142497 has started for PR 33748 at commit |
|
Kubernetes integration test starting |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
Test build #142503 has finished for PR 33748 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileMeta.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala
Show resolved
Hide resolved
|
Test build #142496 has finished for PR 33748 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileMetaCacheManager.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Test build #142634 has finished for PR 33748 at commit
|
|
Test build #142639 has finished for PR 33748 at commit
|
|
Test build #142632 has finished for PR 33748 at commit
|
|
Test build #142640 has finished for PR 33748 at commit
|
|
Ya, it's a little blocked by the on-going discussion because it's related. Thank you for waiting for it, @LuciferYang . |
|
Since the metadata is cached in the executor, does it mean the task reading the same ORC file has to be scheduled on the same executor? How can we guarantee this? @sunchao please correct me if I'm wrong, if we overwrite the table, the same file names will be reused which potentially could cause inconsistent issue. Shouldn't we have some safeguard such as checking the file sizes? Thanks, |
Even overwriting the table, will we use same file names? I remember the file names will include unique task id/attempt id. |
|
In Hive it's common that the same file name (e.g., 000000_0) gets used when doing insert overwrite. Even if we check file size and other stuff, it can't completely prevent us from hitting a stale cache. |
At present, I think it is try best because there is no guarantee of scheduling, if there has many We may need to collect the information about |
Can we add Similarly, how do we ensure that the |
Yea file path + modification time seem like a good way to validate the cache
I think they are not quite the same: FileStatus cache operates on a per-table basis so you'll only get stale data in the worst case. However, here the cache is on a per-file basis, so one could end up having some files in a partition that are cached while the rest are not. In addition, at least in Parquet, metadata, including row count, stats, bloom filter, etc, are used to filter row groups or data pages. Task failure or correctness issue could happen if we apply stale metadata on a newer file, or if the metadata is used in aggregation pushdown (ongoing work).
I understand you want to avoid the duplicate footer lookup. In Parquet at least we can just pass the footer from either |
If we can add some strategies to Spark in the future to ensure that |
|
Yea, but it adds complexity and more memory consumption like you mentioned earlier, and you'll need to have the driver a long running process like a Presto coordinator, which I'm not sure how many people are using Spark this way. |
There should be many. We can do some survey, haha ~ |
Yes, in the production environment, we did change a lot of code for similar optimization |
@dongjoon-hyun Could you elaborate what's the twice footer lookups here in a single task? If it's in a single task, then the files should be the same, so the life of the cache can be just for a single cache, right? I thought the purpose of this PR is when the same ORC file is used for multiple tasks, the fileMetaCache can be used to avoid reading the footer multiple times with a caveat that those tasks have to be scheduled on the same executor. As @LuciferYang and @sunchao mentioned above, this requires adding something like Presto coordinator to ensure the footer cache can be reused. I feel it's fairly complicated, and don't know if it worths it. For this use-case, we might just use Iceberg which stores the metadata as a separated manifest. |
|
@dongjoon-hyun db90daf, 7327fdb change to use |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #142684 has finished for PR 33748 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143257 has finished for PR 33748 at commit
|
What changes were proposed in this pull request?
The main purpose of this pr is to introduce the File Meta Cache mechanism for Spark SQL and the basic File Meta Cache implementation for Orc is provided at the same time. There was originally a PR that supports file meta cache both Parquet and ORC, but
Parquethas non Deprecated API that can be used to pass footer to create newParquetFileReaderand both Apache Spark and Parquet community are reluctant to advertise the deprecated API, so this pr spin off ORC-only part.The main change of this pr as follows:
Defined a
FileMetaCacheManagerto cache the mappingFileMetaKeytoFileMeta. TheFileMetaKeyis the cache key,equalsis determined by the file path by default. TheFileMetaused to represent the cache value and It is generated by theFileMetaKey#getFileMetamethod.Currently, the
FileMetaCacheManagersupports a simple cache expiration elimination mechanism, the expiration time is determined by the new configFILE_META_CACHE_TTL_SINCE_LAST_ACCESSand the maximum number of file meta entries the meta cache contains for each executor is determined by the new configFILE_META_CACHE_MAXIMUM_SIZEFor Orc file format, this pr added
OrcFileMetaKeyandOrcFileMetato cache Orc file Tail and and the Tail cache can be used by Vectorized read scene both in DS API V1 and V2, the feature will be enabled whenFILE_META_CACHE_ENABLED_SOURCE_LISTconfiguredorcCurrently, the file meta cache mechanism cannot be used by
RowBasedReader, and it needs the completion of ORC-746 for further support.The fileMetaCache need users to pay special attention to the following situations:
If the fileMetaCache is enabled, the existing data files should not be replaced with the same file name, otherwise there will be a risk of job failure or wrong data reading before the cache entry expires.
Why are the changes needed?
Support Orc datasource use File Meta Cache mechanism to reduce the times of metadata reads multiple queries are performed on the same dataset.
Does this PR introduce any user-facing change?
Add 3 new config:
FILE_META_CACHE_ENABLED_SOURCE_LIST(spark.sql.fileMetaCache.enabledSourceList): A comma-separated list of data source short names for which data source enabled file meta cache, now the file meta cache only support ORC, it is recommended to enabled this config when multiple queries are performed on the same dataset, default is false.FILE_META_CACHE_TTL_SINCE_LAST_ACCESS(spark.sql.fileMetaCache.ttlSinceLastAccess)to represent Time-to-live for file metadata cache entry after last access, the unit is seconds.FILE_META_CACHE_MAXIMUM_SIZE(spark.sql.fileMetaCache.maximumSize)to represent maximum number of file meta entries the file meta cache contains.How was this patch tested?
FileMetaCacheSuiteandSQLConfSuiteFileMetaCacheReadBenchmarkto measure benefits