-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delta Exporter: Azure Support #7444
Conversation
♻️ PR Preview 1032728 has been successfully destroyed since this PR has been closed. 🤖 By surge-preview |
ec576ce
to
d2fdb99
Compare
d2fdb99
to
f85324a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always great to see test coverage added, thank you! I enjoyed reading this, few questions and requests for clarifications as I'm not very familiar with our exports.
// ListObjects Should be implemented when needed. There are nuances between HNS and BlobStorage which requires understanding the | ||
// Actual use case before implementing the solution | ||
func (c *Client) ListObjects(l *lua.State) int { | ||
lua.Errorf(l, "Not implemented") | ||
panic("unreachable") | ||
} | ||
|
||
// DeleteRecursive Should be implemented when needed. There are nuances between HNS and BlobStorage which requires understanding the | ||
// Actual use case before implementing the solution | ||
func (c *Client) DeleteRecursive(l *lua.State) int { | ||
lua.Errorf(l, "Not implemented") | ||
panic("unreachable") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how it works, but if this client implements an interface, wouldn't it be failing without the user understand why it was not implemented?
The user wants to export tables, can it work without these 2? Don't we delete objects once we delete a branch for example?
Script: script, | ||
Args: args, | ||
collector: collector, | ||
serverAddress: serverAddress, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lakeFS server address - used to create the lakeFS client in the lua script
@@ -412,18 +412,18 @@ Parameters: | |||
|
|||
A package used to export Delta Lake tables from lakeFS to an external cloud storage. | |||
|
|||
### `lakefs/catalogexport/delta_exporter.export_delta_log(action, table_names, writer, delta_client, table_descriptors_path)` | |||
### `lakefs/catalogexport/delta_exporter.export_delta_log(action, table_def_names, write_object, delta_client, table_descriptors_path)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a breaking change? I'm not very familiar with named params in lua
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug in the documentation - the code did not change
local aws = require("aws") | ||
local formats = require("formats") | ||
local delta_exporter = require("lakefs/catalogexport/delta_exporter") | ||
|
||
local table_descriptors_path = "_lakefs_tables" | ||
local sc = aws.s3_client(args.aws.access_key_id, args.aws.secret_access_key, args.aws.region) | ||
local delta_client = formats.delta_client(args.lakefs.access_key_id, args.lakefs.secret_access_key, args.aws.region) | ||
local delta_table_locations = delta_exporter.export_delta_log(action, args.table_names, sc.put_object, delta_client, table_descriptors_path) | ||
|
||
for t, loc in pairs(delta_table_locations) do | ||
print("Delta Lake exported table \"" .. t .. "\"'s location: " .. loc .. "\n") | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we copying the script here and not utilizing the higher level lua func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the comment - can you explain?
local table_descriptors_path = "_lakefs_tables" | ||
local sc = azure.client(args.azure.storage_account, args.azure.access_key) | ||
local delta_client = formats.delta_client(args.lakefs.access_key_id, args.lakefs.secret_access_key) | ||
local delta_table_locations = delta_exporter.export_delta_log(action, args.table_names, sc.put_object, delta_client, table_descriptors_path) | ||
|
||
for t, loc in pairs(delta_table_locations) do | ||
print("Delta Lake exported table \"" .. t .. "\"'s location: " .. loc .. "\n") | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question applies here
end | ||
args: | ||
azure: | ||
storage_account: "{{ .AzureStorageAccount }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a secret? Consider just putting this as plaintext here if it isn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes it easier to change the storage account name (in a single place) in the code and not in the yaml file
esti/catalog_export_test.go
Outdated
}) | ||
|
||
runs := waitForListRepositoryRunsLen(ctx, t, repo, headCommit.Id, 1) | ||
require.Equal(t, "completed", runs.Results[0].Status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider checking the run result name or other identifier to make sure you're seeing the correct result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, that's a big PR so it'll take some time but I added the most meaningful things that I found so far.
|
||
// DeleteRecursive Should be implemented when needed. There are nuances between HNS and BlobStorage which requires understanding the | ||
// Actual use case before implementing the solution | ||
func (c *Client) DeleteRecursive(l *lua.State) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this function, why keep it if it's not implemented? I find it very confusing including docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For conformity we created a client interface, the client objects need to implement that interface to support all the required functionalities.
This is relevant beyond the scope of the delta exporter. I think it is essential to keep the code modular.
In the future when we decide to add support for Hive export in Azure - we will need to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is AWS Client for S3 and Glue, why did you rename s3.go
to client.go
? Also keeping the glue.go
Let's just keep the files separated with clear names like e.g s3.go
.
client *service.Client | ||
} | ||
|
||
func newClient(ctx context.Context) lua.Function { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be explicit about that client since it's not general Azure client instead storage client?
We're investing in our lua SDK's not only for export hooks, the azure SDK should be consistent with AWS where we have glue, s3 etc.
Now it's easy to change, later on when we want to add some outside of azure storage functionality this will get confusing very fast.
pkg/actions/lua/storage/client.go
Outdated
DeleteRecursive(l *lua.State) int | ||
} | ||
|
||
func InitStorageClient(l *lua.State, client Client) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we don't need this abstraction, I fear it's too early and make things more complex.
To make it easier to use with exporter like export_delta_log
you can use the function param write_object and pass a wrapper function for azure storage client write_object: function(bucket, key, data)
it doesn't need to be on the SDK level.
- GCS is not included here
- In S3/Azure the interface has 2 missing functions (delete_recursive, list_objects missing)
PutObject(l *lua.State) int
is not not matching in both cases (referring to implementation with comment// Skipping first argument as it is the host arg (bucket which is irrelevant in Azure)
in azure that you added.).- Those SDK's are not meant to be used by export hooks only, our users should be able to use those SDK independently for their own needs, the combination of them here doesn't seem friendly to user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing previous comment!
Added some comments, overall looks like a very useful addition :)
Co-authored-by: isan_rivkin <isanrivkin@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Can't wait to use it!
Closes #7296
Change Description
Background
Add support for delta catalog export with Azure
Testing Details
Added new esti test for delta catalog export with AWS and Azure falvors
Breaking Change?
No