Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][FS][Azure] Implement Move() for flat namespace storage accounts #40025

Open
Tom-Newton opened this issue Feb 10, 2024 · 7 comments
Open

Comments

@Tom-Newton
Copy link
Contributor

Describe the enhancement requested

#38704 implemented Move on hierarchical namespace storage accounts but we sill need to implement it for flat namespace storage accounts.

Supporting both file and directory moves would be nice but to achieve parity with GCS and S3 filesystems we only need to support file moves.

This is a child of #18014.

Component(s)

C++

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Feb 10, 2024

cc @felipecrv - you mentioned that you planned to work on this.

Assuming #40021 is completed first, then this ticket should include enabling the python side tests for move. I will add TODO(GH-40025) comments where relevant.

@felipecrv
Copy link
Contributor

cc @felipecrv - you mentioned that you planned to work on this.

Assuming #40021 is completed first, then this ticket should include enabling the python side tests for move. I will add TODO(GH-40025) comments where relevant.

If you can pass the tests on a real HNS storage account, it should be the same test code running against a different environment. The semantics should be the same. I don't plan to add new test cases in azurefs_test.cc for Move() on flat namespace accounts because the test code is supposed to run the same way on these different environments.

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Feb 12, 2024

Assuming #40021 is completed first, then this ticket should include enabling the python side tests for move. I will add TODO(GH-40025) comments where relevant.

If you can pass the tests on a real HNS storage account, it should be the same test code running against a different environment. The semantics should be the same. I don't plan to add new test cases in azurefs_test.cc for Move() on flat namespace accounts because the test code is supposed to run the same way on these different environments.

That all makes sense but I think we are talking cross purposes. I'm not proposing we add new tests. The python tests are generic across all filesystems and #40021 just enables them to run against azurite. However it currently it disables a couple of tests because move is not supported on azurite. I don't think its worth running the python tests against real blob storage.

My point is that the disabled tests should be re-enabled when we add support for move on flat namespace

@felipecrv
Copy link
Contributor

I don't think its worth running the python tests against real blob storage.

I would be grateful if you ran these tests manually against a real storage account with HNS support to catch any issues with the tests themselves. If they pass on the HNS implementation with Move() I can more easily take care of any issue that arises when we run it against flat-namespace Azurite.

@Tom-Newton
Copy link
Contributor Author

I would be grateful if you ran these tests manually against a real storage account with HNS support to catch any issues with the tests themselves.

Turns out this was definitely a good idea. I was expecting everything to pass but there where actually a lot of failures

FAILED pyarrow/tests/test_fs.py::test_filesystem_is_functional_after_pickling[builtin_pickle-AzureFileSystem] - OSError: Failed to create directory '156d07e6-c9d2-11ee-989a-71cec6336ac8/a/aa/aaa/': https://tomtesthns.blob.core.windows.net/156d07e6-c9d2-11ee-989a-71cec6336ac8 Azure Error: [InvalidUri] 400 The request URI is invalid.
FAILED pyarrow/tests/test_fs.py::test_filesystem_is_functional_after_pickling[cloudpickle-AzureFileSystem] - OSError: Failed to create directory '165e703f-c9d2-11ee-989a-71cec6336ac8/a/aa/aaa/': https://tomtesthns.blob.core.windows.net/165e703f-c9d2-11ee-989a-71cec6336ac8 Azure Error: [InvalidUri] 400 The request URI is invalid.
FAILED pyarrow/tests/test_fs.py::test_get_file_info[AzureFileSystem] - OSError: Failed to create directory '165e7042-c9d2-11ee-989a-71cec6336ac8/a/aa/aaa/': https://tomtesthns.blob.core.windows.net/165e7042-c9d2-11ee-989a-71cec6336ac8 Azure Error: [InvalidUri] 400 The request URI is invalid.
FAILED pyarrow/tests/test_fs.py::test_get_file_info_with_selector[AzureFileSystem] - OSError: Failed to delete a directory: selector-dir/: https://tomtesthns.blob.core.windows.net/17d5de34-c9d2-11ee-989a-71cec6336ac8/selector-dir/ Azure Error: [InvalidUri] 400 The request URI is invalid.
FAILED pyarrow/tests/test_fs.py::test_create_dir[AzureFileSystem] - OSError: Failed to create directory '17d5de35-c9d2-11ee-989a-71cec6336ac8/test-directory/': https://tomtesthns.blob.core.windows.net/17d5de35-c9d2-11ee-989a-71cec6336ac8 Azure Error: [InvalidUri] 400 The request URI is invalid.
FAILED pyarrow/tests/test_fs.py::test_delete_dir[AzureFileSystem] - OSError: Failed to create directory '1925cd58-c9d2-11ee-989a-71cec6336ac8/directory/nested/': https://tomtesthns.blob.core.windows.net/1925cd58-c9d2-11ee-989a-71cec6336ac8 Azure Error: [InvalidUri] 400 The request URI is invalid.
FAILED pyarrow/tests/test_fs.py::test_delete_dir_with_explicit_subdir[AzureFileSystem] - OSError: Failed to create directory '1925cd59-c9d2-11ee-989a-71cec6336ac8/directory/': https://tomtesthns.blob.core.windows.net/1925cd59-c9d2-11ee-989a-71cec6336ac8 Azure Error: [InvalidUri] 400 The request URI is invalid.
FAILED pyarrow/tests/test_fs.py::test_delete_dir_contents[AzureFileSystem] - OSError: Failed to create directory '1a1e807e-c9d2-11ee-989a-71cec6336ac8/directory/nested/': https://tomtesthns.blob.core.windows.net/1a1e807e-c9d2-11ee-989a-71cec6336ac8 Azure Error: [InvalidUri] 400 The request URI is invalid.
FAILED pyarrow/tests/test_fs.py::test_copy_file[AzureFileSystem] - OSError: Failed to copy a blob. (https://tomtesthns.blob.core.windows.net/1a1e807f-c9d2-11ee-989a-71cec6336ac8/test-copy-source-file -> https://tomtesthns.blob.core.windows.net/1a1e807f-c9d2-11ee-989a-71cec6336ac8/test-copy-target-file) Azure Error: [CannotVerifyCopySource] 401 Server ...
FAILED pyarrow/tests/test_fs.py::test_move_directory[AzureFileSystem] - OSError: Failed to create directory '1b30faf0-c9d2-11ee-989a-71cec6336ac8/source-dir/': https://tomtesthns.blob.core.windows.net/1b30faf0-c9d2-11ee-989a-71cec6336ac8 Azure Error: [InvalidUri] 400 The request URI is invalid.
FAILED pyarrow/tests/test_fs.py::test_move_file[AzureFileSystem] - OSError: Failed to delete a file: test-move-target-file: https://tomtesthns.blob.core.windows.net/1b30faf1-c9d2-11ee-989a-71cec6336ac8/test-move-target-file Azure Error: [LeaseIdMissing] 412 There is currently a lease on the resource and no lease ID was specified in the request.
FAILED pyarrow/tests/test_fs.py::test_open_output_stream_metadata[AzureFileSystem] - OSError: CommitBlockList failed for 'https://tomtesthns.blob.core.windows.net/283b1457-c9d2-11ee-989a-71cec6336ac8/open-output-stream-metadata'. Committing is required to flush an output/append stream. Azure Error: [InvalidMetadata] 400 The metadata specified is invalid. It has charact...

Most of them are failed to create directory which I think boils down to fs->CreateDir("directory/") failing on hierarchical namespace. I will make an issue and PR about this case.

@Tom-Newton
Copy link
Contributor Author

After sorting the obvious ones (#40052). The move related failures are

FAILED pyarrow/tests/test_fs.py::test_copy_file[AzureFileSystem]

E   OSError: Failed to copy a blob. (https://tomtesthns.blob.core.windows.net/de8a4b50-c9d7-11ee-989a-71cec6336ac8/test-copy-source-file -> https://tomtesthns.blob.core.windows.net/de8a4b50-c9d7-11ee-989a-71cec6336ac8/test-copy-target-file) Azure Error: [CannotVerifyCopySource] 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
E   Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
E   RequestId:00a565a9-001e-0044-6ce4-5d5550000000
E   Time:2024-02-12T18:52:28.9014466Z
E   Request ID: 00a565a9-001e-0044-6ce4-5d5550000000

FAILED pyarrow/tests/test_fs.py::test_move_directory[AzureFileSystem]

E   OSError: Failed to delete a directory: target-dir/: https://tomtesthns.blob.core.windows.net/de8a4b51-c9d7-11ee-989a-71cec6336ac8/target-dir Azure Error: [LeaseIdMissing] 412 There is currently a lease on the resource and no lease ID was specified in the request.
E   There is currently a lease on the resource and no lease ID was specified in the request.
E   RequestId:a33fdb5a-a01f-0062-76e4-5d1d48000000
E   Time:2024-02-12T18:52:31.3655532Z
E   Request ID: a33fdb5a-a01f-0062-76e4-5d1d48000000

FAILED pyarrow/tests/test_fs.py::test_move_file[AzureFileSystem]

E   OSError: Failed to delete a directory: target-dir/: https://tomtesthns.blob.core.windows.net/de8a4b51-c9d7-11ee-989a-71cec6336ac8/target-dir Azure Error: [LeaseIdMissing] 412 There is currently a lease on the resource and no lease ID was specified in the request.
E   There is currently a lease on the resource and no lease ID was specified in the request.
E   RequestId:a33fdb5a-a01f-0062-76e4-5d1d48000000
E   Time:2024-02-12T18:52:31.3655532Z
E   Request ID: a33fdb5a-a01f-0062-76e4-5d1d48000000

@felipecrv
Copy link
Contributor

@av8or1 here is the link to the WIP PR #40203

jorisvandenbossche added a commit that referenced this issue Mar 13, 2024
…ystem` (#40021)

### Rationale for this change
We want to use the new `AzureFileSystem` in `pyarrow`. 

### What changes are included in this PR?
- Add minimal python bindings for `AzureFileSystem`. This includes just enough to run the python tests against azurite plus default credential auth to enable real use of this once this PR merges. 
- Adding additional configuration options and remaining authentication options can be done as a follow up. 
- I tried to copy the existing pybinds for GCS and S3
- Explicitly set `ARROW_AZURE=OFF` rather than relying on defaults. The defaults are different for builds vs tests so this was causing tests to be enabled while Azure was disabled during the build.

### Are these changes tested?
Enabled the the python filesystem tests for the new filesystem. I had to skip azure in a couple of the tests though because they are not yet working on the C++ side. I created Github issues to resolve these #40025 and #40026 and added TODO comments where relevant, that reference these Github issues. 

### Are there any user-facing changes?
`pyarrow` users can now use the native `AzureFileSystem` to get much better reliability and performance compared to `adlfs` based options. 

* Closes: #39968
* GitHub Issue: #39968

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants