-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support "delta format sharing" and release delta-sharing-spark 3.1 #2291
Closed
2 of 8 tasks
Labels
enhancement
New feature or request
Milestone
Comments
8 tasks
5 tasks
This was referenced Jan 6, 2024
vkorukanti
pushed a commit
that referenced
this issue
Jan 9, 2024
Adds snapshot support for "delta format sharing", this is the second PR of issue #2291 - DeltaSharingDataSource with snapshot query support - DeltaSharingDataSourceDeltaSuite - DeltaSharingDataSourceDeltaTestUtils/TestClientForDeltaFormatSharing/TestDeltaSharingFileSystem Closes #2440 GitOrigin-RevId: a095445b6da809ee9a5b4ece7c38d04a172ff70f
This was referenced Jan 10, 2024
vkorukanti
pushed a commit
that referenced
this issue
Jan 12, 2024
…n 3.1 ## Description (Cherry-pick of #2472 to branch-3.1) Fourth PR of #2291: Adds Streaming support for "delta format sharing", and add column mapping test - DeltaSharingDataSource with streaming query support - DeltaFormatSharingSource - DeltaFormatSharingSourceSuite/DeltaSharingDataSourceCMSuite ## How was this patch tested? Unit Tests
vkorukanti
pushed a commit
that referenced
this issue
Jan 13, 2024
…ring ## Description (Cherry-pick of #2480 to branch-3.1) Fifth PR of #2291: Adds deletion vector support for "delta format sharing" - Extends PrepareDeltaScan to PrepareDeltaSharingScan, to convert DeltaSharingFileIndex to TahoeLogFileIndex. - Update DeltaSparkSessionExtension to add the rule of PrepareDeltaSharingScan - Added unit test in DeltaSharingDataSourceDeltaSuite ## How was this patch tested? Unit Tests
github-project-automation
bot
moved this from Todo
to Done
in Linux Foundation Delta Lake Roadmap
Jan 30, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Feature request
Support "delta format sharing" and release delta-sharing-spark 3.1
Which Delta project/connector is this regarding?
delta-sharing-spark
Context: Advanced Delta Features
Advanced delta features such as DeletionVectors and ColumnMapping are developed where delta is no longer a parquet only protocol. In order to catch up with new advanced delta features, we are proposing to upgrade delta sharing protocol to support "delta format sharing", where we could return the shared table in delta format, and leverage developed delta spark library to read data. The benefit would be to avoid code duplication on supporting newly created advanced delta features in delta sharing spark, and easier to catch up on future delta features.
Please refer delta-io/delta-sharing#341 for original proposal.
Delta Format Sharing
The idea is to transfer the delta log from the provider to the recipient via the delta sharing http requests, construct a local delta log, and leverage delta spark library to read the data out of the delta log.
Protocol Changes
In the delta sharing protocol, a new http request header
delta-sharing-capabilities
will be introduced, where its value will be comma separated capabilities, where each capability is likecapability_key=capability_value
. Example:delta-sharing-capabilities:responseFormat=delta,readerfeatures=deletionVectors,columnMapping
.For upgraded delta sharing server that could handle the new header, it will parse the new header and prepare the response accordingly, it will ignore the capabilities that cannot be handled or having an unrecognized value. But it will return error if the shared table has capabilities that is not specified in the header (indicating it's supported by the client).
If the
responseFormat=delta
in the request header and the delta sharing server could handle it, then it will add a similar header in the response as well to indicate that it's handled:delta-sharing-capabilities:responseFormat=delta
. Then each line in the response is a json object that could be parsed as a delta action, and could be constructed as a delta log on the client side. With the only change to be thepath
will be a pre-signed url, so the client side needs to read data out of the pre-signed url.The goal is to support delta format sharing, in delta-io/delta oss repo, and release delta-sharing-spark 3.1.
Library Changes
In order to support this, we need to restructure the delta sharing libraries. We'll launch a delta-sharing-client library to include code with two core functionalities: delta sharing client and related utils that handle http requests/responses to the delta sharing server, delta sharing file system and related utils that handle reading data out of pre-signed url and refreshing of pre-signed urls. With
responseFormat=delta
, the delta sharing client won't parse the json lines and will let the delta spark library to parse and handle them.We'll continue to release delta-sharing-spark library with the rest of the functionalities including data source, the streaming source, options, etc. While all the code will be moved from delta-io/delta-sharing to delta-io/delta to be able to leverage all the delta classes and libraries to construct a delta log, read data, and finally serve the DataFrame to the query.
High Level Implementation Details
Snapshot Queries
For snapshot queries, as we need to handle filters push down to delta sharing server:
createRelation
requires aBaseRelation
to be returned, so we will return a basicHadoopFsRelation
, where a FileIndex is required.client.getMetadata
.DeltaSharingFileIndex
class is built with the fetched metadata, and used in aHadoopFsRelation
, to return forcreateRelation
.DeltaSharingFileIndex.listFiles
, bothpartitionFilters
anddataFilters
are ready, which will be converted tojsonPredicateHints
in the rpc query to the delta sharing server.BlockManager
, including theProtocol/Metadata/AddFiles
.DeltaSharingLogFileSystem
: A file system with the scheme asdelta-sharing-log://
, the localDeltaLog
will have a path starting withdelta-sharing-log://
, andDeltaSharingLogFileSystem
will translate file path to blockIds that contains the content inBlockManager
, and serve the content.DeltaLog
class is constructed with a delta-sharing-log path pointing to the block manager.TahoeLogFileIndex
is constructed with theDeltaLog
and the result from this file index’slistFiles
will be returned.CDF Queries
For cdf queries, as we are not applying filtering pushdown, we’ll directly fetch delta files from the delta sharing server, construct the delta log, and leverage delta spark to read cdf out of it.
createRelation
is called asreadChangeFeed=true
, we’ll start to prepare aDeltaCDFRelation
.DeltaSharingClient
will send rpc with the cdf options to fetch the needed files in delta format.prepare a fake checkpoint with minVersion-1 to avoid going back to version 0.
Protocol
andMetadata
will be put in minVersion.json as required by theCDCReader
.FileActions
will be put in corresponding version.json since version is returned for each of them from the server.BlockManager
and served throughDeltaSharingLogFileSystem
.DeltaTableV2.toBaseRelation
with a delta-sharing-log path pointing to the locally constructed table is used to return theBaseRelation
needed bycreateRelation
.Streaming Queries
For streaming queries, we will create a new class
DeltaFormatSharingSource
, which wraps an instance ofDeltaSource
, and for the two main APIslatestOffset
andgetBatch
, it will firstly perform delta sharing related tasks: such as to useDeltaSharingClient
to request the files from the server side, store it in the localBlockManager
, and then leverageDeltaSource
to perform the concrete operations. Specifically:DeltaSharingDataSource
, when receiving a streaming query to create a Source, if it’s asking for delta format sharing, we’ll return theDeltaFormatSharingSource
, otherwise, theDeltaSharingSource
for parquet format sharing.DeltaFormatSharingSource
, we wrap an instance ofDeltaSource
, i.e.,deltaSource
, which is built on top of the path of the local delta log.DeltaSource
to use..latestOffset
is called, we’ll firstly check whether there are unprocessed data files in the recipient local delta log,If all local data is processed, we use
getTableVersion
to check if there is new data from the provider side, if so we fetch the new data withqueryTable
, store it locally (in the meantime, we also need to prepare the file id to url mapping, setup url refresh, etc).deltaSource.latestOffset
is called and the return value is returned.getBatch
is called, we’ll directly calldeltaSource.getBatch
with the same parameters to return the requested data.BlockManager
before version-1.Project Plan
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: