Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Add docs for Azure IO #1851

Merged
merged 2 commits into from
Feb 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/user_guide/integrations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ Integrations
.. toctree::

integrations/iceberg
integrations/microsoft-azure
64 changes: 64 additions & 0 deletions docs/source/user_guide/integrations/microsoft-azure.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
Microsoft Azure
===============

Daft is able to read/write data to/from Azure Blob Store, and understands natively the URL protocols ``az://`` and ``abfs://`` as referring to data that resides
in Azure Blob Store.

.. WARNING::

Daft currently only supports globbing and listing files in storage accounts with `hierarchical namespaces <https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace>`_ enabled.

Hierarchical namespaces enable Daft to use its embarrassingly parallel globbing algorithm to improve performance of listing large nested directories of data.

Please file an issue if you need support for non-hierarchical namespace buckets! We'd love to support your use-case.

Authorization/Authentication
----------------------------

In Azure Blob Service, data is stored under the hierarchy of:

1. Storage Account
2. Container (sometimes referred to as "bucket" in S3-based services)
3. Object Key

URLs to data in Azure Blob Store come in the form: ``az://{CONTAINER_NAME}/{OBJECT_KEY}``.

Given that the Storage Account is not a part of the URL, you must provide this separately.

Rely on Environment
*******************

You can rely on Azure's `environment variables <https://learn.microsoft.com/en-us/azure/storage/blobs/authorize-data-operations-cli#set-environment-variables-for-authorization-parameters>`_
to have Daft automatically discover credentials.

Please be aware that when doing so in a distributed environment such as Ray, Daft will pick these credentials up from worker machines and thus each worker machine needs to be appropriately provisioned.

If instead you wish to have Daft use credentials from the "driver", you may wish to manually specify your credentials.

Manually specify credentials
****************************

You may also choose to pass these values into your Daft I/O function calls using an :class:`daft.io.AzureConfig` config object.

:func:`daft.set_planning_config` is a convenient way to set your :class:`daft.io.IOConfig` as the default config to use on any subsequent Daft method calls.

.. code:: python

from daft.io import IOConfig, AzureConfig

# Supply actual values for the storage_account and access key here
io_config = IOConfig(azure=AzureConfig(storage_account="***", access_key="***"))

# Globally set the default IOConfig for any subsequent I/O calls
daft.set_planning_config(default_io_config=io_config)

# Perform some I/O operation
df = daft.read_parquet("az://my_container/my_path/**/*")

Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the ``io_config=`` keyword argument. This is extremely flexible as you can
pass a different :class:`daft.io.AzureConfig` per function call if you wish!

.. code:: python

# Perform some I/O operation but override the IOConfig
df2 = daft.read_csv("az://my_container/my_other_path/**/*", io_config=io_config)
Loading