[DOCS] Add docs for Azure IO (#1851)

Adds some starter documentation for using Daft with Azure services --------- Co-authored-by: Jay Chia <jaychia94@gmail.com@users.noreply.github.com>
Eventual-Inc · Feb 8, 2024 · 1fae7ad · 1fae7ad
1 parent f74012d
commit 1fae7ad
Show file tree

Hide file tree

Showing 2 changed files with 65 additions and 0 deletions.
diff --git a/docs/source/user_guide/integrations.rst b/docs/source/user_guide/integrations.rst
@@ -4,3 +4,4 @@ Integrations
 .. toctree::
 
     integrations/iceberg
+    integrations/microsoft-azure
diff --git a/docs/source/user_guide/integrations/microsoft-azure.rst b/docs/source/user_guide/integrations/microsoft-azure.rst
@@ -0,0 +1,64 @@
+Microsoft Azure
+===============
+
+Daft is able to read/write data to/from Azure Blob Store, and understands natively the URL protocols ``az://`` and ``abfs://`` as referring to data that resides
+in Azure Blob Store.
+
+.. WARNING::
+
+    Daft currently only supports globbing and listing files in storage accounts with `hierarchical namespaces <https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace>`_ enabled.
+
+    Hierarchical namespaces enable Daft to use its embarrassingly parallel globbing algorithm to improve performance of listing large nested directories of data.
+
+    Please file an issue if you need support for non-hierarchical namespace buckets! We'd love to support your use-case.
+
+Authorization/Authentication
+----------------------------
+
+In Azure Blob Service, data is stored under the hierarchy of:
+
+1. Storage Account
+2. Container (sometimes referred to as "bucket" in S3-based services)
+3. Object Key
+
+URLs to data in Azure Blob Store come in the form: ``az://{CONTAINER_NAME}/{OBJECT_KEY}``.
+
+Given that the Storage Account is not a part of the URL, you must provide this separately.
+
+Rely on Environment
+*******************
+
+You can rely on Azure's `environment variables <https://learn.microsoft.com/en-us/azure/storage/blobs/authorize-data-operations-cli#set-environment-variables-for-authorization-parameters>`_
+to have Daft automatically discover credentials.
+
+Please be aware that when doing so in a distributed environment such as Ray, Daft will pick these credentials up from worker machines and thus each worker machine needs to be appropriately provisioned.
+
+If instead you wish to have Daft use credentials from the "driver", you may wish to manually specify your credentials.
+
+Manually specify credentials
+****************************
+
+You may also choose to pass these values into your Daft I/O function calls using an :class:`daft.io.AzureConfig` config object.
+
+:func:`daft.set_planning_config` is a convenient way to set your :class:`daft.io.IOConfig` as the default config to use on any subsequent Daft method calls.
+
+.. code:: python
+
+    from daft.io import IOConfig, AzureConfig
+
+    # Supply actual values for the storage_account and access key here
+    io_config = IOConfig(azure=AzureConfig(storage_account="***", access_key="***"))
+
+    # Globally set the default IOConfig for any subsequent I/O calls
+    daft.set_planning_config(default_io_config=io_config)
+
+    # Perform some I/O operation
+    df = daft.read_parquet("az://my_container/my_path/**/*")
+
+Alternatively, Daft supports overriding the default IOConfig per-operation by passing it into the ``io_config=`` keyword argument. This is extremely flexible as you can
+pass a different :class:`daft.io.AzureConfig` per function call if you wish!
+
+.. code:: python
+
+    # Perform some I/O operation but override the IOConfig
+    df2 = daft.read_csv("az://my_container/my_other_path/**/*", io_config=io_config)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -4,3 +4,4 @@ Integrations
		.. toctree::

		integrations/iceberg
		integrations/microsoft-azure