diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index 3370856ce..ad9469b26 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -298,7 +298,7 @@ recommend using the functional form of DataPipes. Working with Cloud Storage Providers --------------------------------------------- -In this section, we show examples accessing AWS S3 and Google Cloud Storage with built-in``fsspec`` DataPipes. +In this section, we show examples accessing AWS S3, Google Cloud Storage, and Azure Cloud Storage with built-in ``fsspec`` DataPipes. Although only those two providers are discussed here, with additional libraries, ``fsspec`` DataPipes should allow you to connect with other storage systems as well (`list of known implementations `_). @@ -384,3 +384,43 @@ directory ``applications``. # gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-application_data.tsv, StreamWrapper<...> # gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-continuity_data.tsv, StreamWrapper<...> # gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-transaction_history.tsv, StreamWrapper<...> + +Accessing Azure Blob storage with ``fsspec`` DataPipes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This requires the installation of the libraries ``fsspec`` +(`documentation `_) and ``adlfs`` +(`adlfs GitHub repo `_). +You can access data in Azure Data Lake Storage Gen2 by providing URIs staring with ``abfs://``. +For example, +`FSSpecFileLister `_ (``.list_files_by_fsspec(...)``) +can be used to list files in a directory in a container: + +.. code:: python + + from torchdata.datapipes.iter import IterableWrapper + + storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY} + dp = IterableWrapper(['abfs://CONTAINER/DIRECTORY']).list_files_by_fsspec(**storage_options) + print(list(dp)) + # ['abfs://container/directory/file1.txt', 'abfs://container/directory/file2.txt', ...] + +You can also open files using `FSSpecFileOpener `_ +(``.open_files_by_fsspec(...)``) and stream them +(if supported by the file format). + +Here is an example of loading a CSV file ``ecdc_cases.csv`` from a public container inside the +directory ``curated/covid-19/ecdc_cases/latest``, belonging to account ``pandemicdatalake``. + +.. code:: python + + from torchdata.datapipes.iter import IterableWrapper + dp = IterableWrapper(['abfs://public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv']) \ + .open_files_by_fsspec(account_name='pandemicdatalake') \ + .parse_csv() + print(list(dp)[:3]) + # [['date_rep', 'day', ..., 'iso_country', 'daterep'], + # ['2020-12-14', '14', ..., 'AF', '2020-12-14'], + # ['2020-12-13', '13', ..., 'AF', '2020-12-13']] +If necessary, you can also access data in Azure Data Lake Storage Gen1 by using URIs staring with +``adl://`` and ``abfs://``, as described in `README of adlfs repo `_