Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39968: [Python][FS][Azure] Minimal Python bindings for AzureFileSystem #40021

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
c520a1d
Check in patch file of python bindings from my custom build
Tom-Newton Feb 7, 2024
aa0a1b8
Complete bindings
Tom-Newton Feb 7, 2024
ffd1534
Python side boilerplate
Tom-Newton Feb 7, 2024
3cdbc2e
Start tests
Tom-Newton Feb 8, 2024
46d9135
Update patch
Tom-Newton Feb 8, 2024
e51af7b
Working build with azure bindings
Tom-Newton Feb 8, 2024
6f6cf33
Fix account_name configuration
Tom-Newton Feb 8, 2024
6321937
Sufficient pybinds to connect to azurite
Tom-Newton Feb 8, 2024
dedeea8
Somewhat working azurite tests
Tom-Newton Feb 8, 2024
f7c650a
Skip move tests which are not yet implemented
Tom-Newton Feb 9, 2024
995d787
Update skipped tests
Tom-Newton Feb 9, 2024
7af5a42
Working pickling tests
Tom-Newton Feb 10, 2024
85d9de3
Update TODO comments with references to relevant Github issues
Tom-Newton Feb 10, 2024
9bb2c1b
Add test_azurefs_options
Tom-Newton Feb 10, 2024
4641fc7
Tidy azure_server
Tom-Newton Feb 10, 2024
9d11166
A bit of alphabetical ordering
Tom-Newton Feb 10, 2024
2b68dff
ARchery lint
Tom-Newton Feb 10, 2024
c5cb74a
C++ autoformat
Tom-Newton Feb 10, 2024
d9cf2ea
Update cpp/src/arrow/util/config.h.cmake
Tom-Newton Feb 10, 2024
26a4632
Update python/pyarrow/_azurefs.pyx
Tom-Newton Feb 11, 2024
cb0aefd
Write a docstring
Tom-Newton Feb 12, 2024
02a2233
Update docstring to mention `/` as only supported delimiter
Tom-Newton Feb 12, 2024
ca97370
Capitalise Azurite
Tom-Newton Feb 12, 2024
744b119
Update docstring with PR comments
Tom-Newton Feb 13, 2024
fc0940a
Add comment about azurite credentials
Tom-Newton Feb 13, 2024
6b83cca
Enable Azure tests whenever AzureFileSystem can be imported
Tom-Newton Feb 13, 2024
7c84225
Docstring correction
Tom-Newton Feb 14, 2024
b8ae75a
Move account_name and account_key to `azure_server` fixture
Tom-Newton Feb 14, 2024
3d7717a
Only run blob emulator not queue or table
Tom-Newton Feb 19, 2024
0174285
Mention azurite support in docstring
Tom-Newton Feb 19, 2024
514c5dd
Set allow_move_dir=True on Azure tests
Tom-Newton Feb 22, 2024
30fae58
Spelling
Tom-Newton Feb 27, 2024
4801933
Re-order list
Tom-Newton Feb 27, 2024
0bb64bf
Use skip instead of xfail
Tom-Newton Feb 27, 2024
7ad9287
Re-order list
Tom-Newton Feb 28, 2024
20e7a31
Explicitly set ARROW_AZURE=OFF instead of leaving default
Tom-Newton Feb 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/docker/alpine-linux-3.16-cpp.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ COPY ci/scripts/install_gcs_testbench.sh /arrow/ci/scripts/
RUN /arrow/ci/scripts/install_gcs_testbench.sh default

ENV ARROW_ACERO=ON \
ARROW_AZURE=OFF \
ARROW_BUILD_TESTS=ON \
ARROW_DATASET=ON \
ARROW_DEPENDENCY_SOURCE=SYSTEM \
Expand Down
1 change: 1 addition & 0 deletions ci/docker/fedora-39-cpp.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ RUN /arrow/ci/scripts/install_sccache.sh unknown-linux-musl /usr/local/bin
# Python process explicitly if we use LLVM 17 or later.
ENV absl_SOURCE=BUNDLED \
ARROW_ACERO=ON \
ARROW_AZURE=OFF \
ARROW_BUILD_TESTS=ON \
ARROW_DEPENDENCY_SOURCE=SYSTEM \
ARROW_DATASET=ON \
Expand Down
1 change: 1 addition & 0 deletions ci/docker/linux-apt-docs.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ RUN /arrow/ci/scripts/r_deps.sh /arrow && \
R -e "install.packages('pkgdown')"

ENV ARROW_ACERO=ON \
ARROW_AZURE=OFF \
ARROW_BUILD_STATIC=OFF \
ARROW_BUILD_TESTS=OFF \
ARROW_BUILD_UTILITIES=OFF \
Expand Down
1 change: 1 addition & 0 deletions ci/docker/ubuntu-20.04-cpp-minimal.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ COPY ci/scripts/install_sccache.sh /arrow/ci/scripts/
RUN /arrow/ci/scripts/install_sccache.sh unknown-linux-musl /usr/local/bin

ENV ARROW_ACERO=ON \
ARROW_AZURE=OFF \
ARROW_BUILD_TESTS=ON \
ARROW_DATASET=ON \
ARROW_FLIGHT=ON \
Expand Down
1 change: 1 addition & 0 deletions ci/docker/ubuntu-22.04-cpp-minimal.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ COPY ci/scripts/install_sccache.sh /arrow/ci/scripts/
RUN /arrow/ci/scripts/install_sccache.sh unknown-linux-musl /usr/local/bin

ENV ARROW_ACERO=ON \
ARROW_AZURE=OFF \
ARROW_BUILD_TESTS=ON \
ARROW_DATASET=ON \
ARROW_FLIGHT=ON \
Expand Down
5 changes: 4 additions & 1 deletion cpp/src/arrow/filesystem/api.h
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,13 @@
#include "arrow/util/config.h" // IWYU pragma: export

#include "arrow/filesystem/filesystem.h" // IWYU pragma: export
#include "arrow/filesystem/hdfs.h" // IWYU pragma: export
#ifdef ARROW_AZURE
#include "arrow/filesystem/azurefs.h" // IWYU pragma: export
#endif
#ifdef ARROW_GCS
#include "arrow/filesystem/gcsfs.h" // IWYU pragma: export
#endif
#include "arrow/filesystem/hdfs.h" // IWYU pragma: export
#include "arrow/filesystem/localfs.h" // IWYU pragma: export
#include "arrow/filesystem/mockfs.h" // IWYU pragma: export
#ifdef ARROW_S3
Expand Down
2 changes: 2 additions & 0 deletions cpp/src/arrow/filesystem/azurefs_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2458,6 +2458,7 @@ TEST_F(TestAzuriteFileSystem, WriteMetadata) {
ASSERT_OK(output->Close());

// Verify the metadata has been set.
// TODO(GH-40025): Use `AzureFileSystem` to fetch metadata for this assertion.
auto blob_metadata = blob_service_client_->GetBlobContainerClient(data.container_name)
.GetBlockBlobClient(blob_path)
.GetProperties()
Expand All @@ -2470,6 +2471,7 @@ TEST_F(TestAzuriteFileSystem, WriteMetadata) {
full_path, /*metadata=*/arrow::key_value_metadata({{"bar", "foo"}})));
ASSERT_OK(output->Write(expected));
ASSERT_OK(output->Close());
// TODO(GH-40025): Use `AzureFileSystem` to fetch metadata for this assertion.
blob_metadata = blob_service_client_->GetBlobContainerClient(data.container_name)
.GetBlockBlobClient(blob_path)
.GetProperties()
Expand Down
7 changes: 4 additions & 3 deletions cpp/src/arrow/filesystem/type_fwd.h
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,12 @@ struct FileInfo;
struct FileSelector;

class FileSystem;
class SubTreeFileSystem;
class SlowFileSystem;
class AzureFileSystem;
class GcsFileSystem;
class LocalFileSystem;
class S3FileSystem;
class GcsFileSystem;
class SlowFileSystem;
class SubTreeFileSystem;

} // namespace fs
} // namespace arrow
1 change: 1 addition & 0 deletions cpp/src/arrow/util/config.h.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
#cmakedefine ARROW_PARQUET
#cmakedefine ARROW_SUBSTRAIT

#cmakedefine ARROW_AZURE
#cmakedefine ARROW_ENABLE_THREADING
#cmakedefine ARROW_GCS
#cmakedefine ARROW_HDFS
Expand Down
4 changes: 4 additions & 0 deletions python/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -554,6 +554,10 @@ set_source_files_properties(pyarrow/lib.pyx PROPERTIES CYTHON_API TRUE)

set(LINK_LIBS arrow_python)

if(PYARROW_BUILD_AZURE)
list(APPEND CYTHON_EXTENSIONS _azurefs)
endif()

if(PYARROW_BUILD_GCS)
list(APPEND CYTHON_EXTENSIONS _gcsfs)
endif()
Expand Down
3 changes: 2 additions & 1 deletion python/pyarrow/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,8 @@ def print_entry(label, value):
print(f" {module: <20}: {status: <8}")

print("\nFilesystems:")
filesystems = ["GcsFileSystem", "HadoopFileSystem", "S3FileSystem"]
filesystems = ["AzureFileSystem", "GcsFileSystem",
"HadoopFileSystem", "S3FileSystem"]
for fs in filesystems:
status = "Enabled" if _filesystem_is_available(fs) else "-"
print(f" {fs: <20}: {status: <8}")
Expand Down
134 changes: 134 additions & 0 deletions python/pyarrow/_azurefs.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# cython: language_level = 3

from cython cimport binding


from pyarrow.lib import frombytes, tobytes
from pyarrow.includes.libarrow_fs cimport *
from pyarrow._fs cimport FileSystem


cdef class AzureFileSystem(FileSystem):
Tom-Newton marked this conversation as resolved.
Show resolved Hide resolved
"""
Azure Blob Storage backed FileSystem implementation

This implementation supports flat namespace and hierarchical namespace (HNS) a.k.a.
Data Lake Gen2 storage accounts. HNS will be automatically detected and HNS specific
features will be used when they provide a performance advantage. Azurite emulator is
also supported. Note: `/` is the only supported delimiter.

The storage account is considered the root of the filesystem. When enabled, containers
will be created or deleted during relevant directory operations. Obviously, this also
requires authentication with the additional permissions.

By default `DefaultAzureCredential <https://github.com/Azure/azure-sdk-for-cpp/blob/main/sdk/identity/azure-identity/README.md#defaultazurecredential>`__
is used for authentication. This means it will try several types of authentication
and go with the first one that works. If any authentication parameters are provided when
initialising the FileSystem, they will be used instead of the default credential.

Parameters
----------
account_name : str
Azure Blob Storage account name. This is the globally unique identifier for the
storage account.
account_key : str, default None
Account key of the storage account. Pass None to use default credential.
blob_storage_authority : str, default None
hostname[:port] of the Blob Service. Defaults to `.blob.core.windows.net`. Useful
for connecting to a local emulator, like Azurite.
dfs_storage_authority : str, default None
hostname[:port] of the Data Lake Gen 2 Service. Defaults to
`.dfs.core.windows.net`. Useful for connecting to a local emulator, like Azurite.
blob_storage_scheme : str, default None
Either `http` or `https`. Defaults to `https`. Useful for connecting to a local
emulator, like Azurite.
dfs_storage_scheme : str, default None
Either `http` or `https`. Defaults to `https`. Useful for connecting to a local
emulator, like Azurite.
Comment on lines +59 to +64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou should this also change to enable_tls like you did in the URI parsing? cc @Tom-Newton

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think so. We may want to use AzureOptions::FromUri() instead of re-implementing the same logic.

@Tom-Newton Could you follow-up this?

Copy link
Contributor Author

@Tom-Newton Tom-Newton Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created the issue for completing the python bindings and referenced this conversation #40572. There is a good chance that I will work on it but I can't say when.


Examples
--------
>>> from pyarrow import fs
>>> azure_fs = fs.AzureFileSystem(account_name='myaccount')
>>> azurite_fs = fs.AzureFileSystem(
... account_name='devstoreaccount1',
... account_key='Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==',
... blob_storage_authority='127.0.0.1:10000',
... dfs_storage_authority='127.0.0.1:10000',
... blob_storage_scheme='http',
... dfs_storage_scheme='http',
... )

For usage of the methods see examples for :func:`~pyarrow.fs.LocalFileSystem`.
"""
cdef:
CAzureFileSystem* azurefs
c_string account_key

def __init__(self, account_name, *, account_key=None, blob_storage_authority=None,
dfs_storage_authority=None, blob_storage_scheme=None,
dfs_storage_scheme=None):
cdef:
CAzureOptions options
shared_ptr[CAzureFileSystem] wrapped

options.account_name = tobytes(account_name)
if blob_storage_authority:
options.blob_storage_authority = tobytes(blob_storage_authority)
if dfs_storage_authority:
options.dfs_storage_authority = tobytes(dfs_storage_authority)
if blob_storage_scheme:
options.blob_storage_scheme = tobytes(blob_storage_scheme)
if dfs_storage_scheme:
options.dfs_storage_scheme = tobytes(dfs_storage_scheme)

if account_key:
options.ConfigureAccountKeyCredential(tobytes(account_key))
self.account_key = tobytes(account_key)
else:
options.ConfigureDefaultCredential()
Comment on lines +103 to +106
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the Configure methods on AzureOptions optionally do IO? If so, they should probably be called without the GIL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it does any IO. I think the tests actually prove this. These configure the filesystem for connection to a storage account that doesn't exist. If they were doing any IO I think these tests would fail.

Copy link
Member

@pitrou pitrou Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. I was asking because S3 does, annoyingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the error handling really happens in AzureOptions::Make(Blob|DataLake)ServiceClient()


with nogil:
wrapped = GetResultValue(CAzureFileSystem.Make(options))

self.init(<shared_ptr[CFileSystem]> wrapped)

cdef init(self, const shared_ptr[CFileSystem]& wrapped):
FileSystem.init(self, wrapped)
self.azurefs = <CAzureFileSystem*> wrapped.get()

@staticmethod
@binding(True) # Required for cython < 3
def _reconstruct(kwargs):
# __reduce__ doesn't allow passing named arguments directly to the
# reconstructor, hence this wrapper.
return AzureFileSystem(**kwargs)

def __reduce__(self):
cdef CAzureOptions opts = self.azurefs.options()
return (
AzureFileSystem._reconstruct, (dict(
account_name=frombytes(opts.account_name),
account_key=frombytes(self.account_key),
blob_storage_authority=frombytes(opts.blob_storage_authority),
dfs_storage_authority=frombytes(opts.dfs_storage_authority),
blob_storage_scheme=frombytes(opts.blob_storage_scheme),
dfs_storage_scheme=frombytes(opts.dfs_storage_scheme)
),))
3 changes: 3 additions & 0 deletions python/pyarrow/_fs.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -491,6 +491,9 @@ cdef class FileSystem(_Weakrefable):
elif typ == 'gcs':
from pyarrow._gcsfs import GcsFileSystem
self = GcsFileSystem.__new__(GcsFileSystem)
elif typ == 'abfs':
from pyarrow._azurefs import AzureFileSystem
self = AzureFileSystem.__new__(AzureFileSystem)
elif typ == 'hdfs':
from pyarrow._hdfs import HadoopFileSystem
self = HadoopFileSystem.__new__(HadoopFileSystem)
Expand Down
9 changes: 8 additions & 1 deletion python/pyarrow/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@

groups = [
'acero',
'azure',
'brotli',
'bz2',
'cython',
Expand Down Expand Up @@ -54,6 +55,7 @@

defaults = {
'acero': False,
'azure': False,
Tom-Newton marked this conversation as resolved.
Show resolved Hide resolved
'brotli': Codec.is_available('brotli'),
'bz2': Codec.is_available('bz2'),
'cython': False,
Expand Down Expand Up @@ -142,13 +144,18 @@
except ImportError:
pass

try:
from pyarrow.fs import AzureFileSystem # noqa
defaults['azure'] = True
except ImportError:
pass

try:
from pyarrow.fs import GcsFileSystem # noqa
defaults['gcs'] = True
except ImportError:
pass


try:
from pyarrow.fs import S3FileSystem # noqa
defaults['s3'] = True
Expand Down
4 changes: 4 additions & 0 deletions python/pyarrow/fs.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@
FileStats = FileInfo

_not_imported = []
try:
from pyarrow._azurefs import AzureFileSystem # noqa
except ImportError:
_not_imported.append("AzureFileSystem")

try:
from pyarrow._hdfs import HadoopFileSystem # noqa
Expand Down
16 changes: 16 additions & 0 deletions python/pyarrow/includes/libarrow_fs.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,22 @@ cdef extern from "arrow/filesystem/api.h" namespace "arrow::fs" nogil:
CResult[shared_ptr[CGcsFileSystem]] Make(const CGcsOptions& options)
CGcsOptions options()

cdef cppclass CAzureOptions "arrow::fs::AzureOptions":
c_string account_name
c_string blob_storage_authority
c_string dfs_storage_authority
c_string blob_storage_scheme
c_string dfs_storage_scheme

c_bool Equals(const CAzureOptions& other)
CStatus ConfigureDefaultCredential()
CStatus ConfigureAccountKeyCredential(c_string account_key)

cdef cppclass CAzureFileSystem "arrow::fs::AzureFileSystem":
@staticmethod
CResult[shared_ptr[CAzureFileSystem]] Make(const CAzureOptions& options)
CAzureOptions options()

cdef cppclass CHdfsOptions "arrow::fs::HdfsOptions":
HdfsConnectionConfig connection_config
int32_t buffer_size
Expand Down
31 changes: 31 additions & 0 deletions python/pyarrow/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,37 @@ def gcs_server():
proc.wait()


@pytest.fixture(scope='session')
def azure_server(tmpdir_factory):
port = find_free_port()
env = os.environ.copy()
tmpdir = tmpdir_factory.getbasetemp()
# We only need blob service emulator, not queue or table.
args = ['azurite-blob', "--location", tmpdir, "--blobPort", str(port)]
proc = None
try:
proc = subprocess.Popen(args, env=env)
# Make sure the server is alive.
if proc.poll() is not None:
pytest.skip(f"Command {args} did not start server successfully!")
except (ModuleNotFoundError, OSError) as e:
pytest.skip(f"Command {args} failed to execute: {e}")
else:
yield {
# Use the standard azurite account_name and account_key.
# https://learn.microsoft.com/en-us/azure/storage/common/storage-use-emulator#authorize-with-shared-key-credentials
'connection': ('127.0.0.1', port, 'devstoreaccount1',
'Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2'
'UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=='),
'process': proc,
'tempdir': tmpdir,
}
finally:
if proc is not None:
proc.kill()
proc.wait()


@pytest.fixture(
params=[
'builtin_pickle',
Expand Down
Loading
Loading