Skip to content

Commit

Permalink
Add docs on accessing Azure blob storage through fsspec (#836)
Browse files Browse the repository at this point in the history
Summary:
### Changes

Adding an example of DataPipe usage with Azure Blob storage via `fsspec`, similar to #812. The example is placed into a new section in `docs/source/tutorial.rst`

Here is the screenshot showing that code snippets in the tutorial work as expected:

<img width="1569" alt="Screenshot 2022-10-18 at 19 33 49" src="https://user-images.githubusercontent.com/23200558/196503562-034162c0-6dde-4749-adc7-5e081ff2c19f.png">

####  Minor note

Technically, `fsspec` [allows both path prefixes `abfs://` or `az://`](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/README.md?plain=1#L33) for Azure Blob storage Gen2 as synonyms. However, only `abfs://` works for us for the following reason:
- If a path starts with `az`, the variable `fs.protocol` [here](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L82) is still `abfs`
- So the condition `root.startswith(protocol)` is false, and `is_local` is true
- As a result the path "doubles" in [this line](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L95), like on this screenshot:
<img width="754" alt="Screenshot 2022-10-18 at 19 50 56" src="https://user-images.githubusercontent.com/23200558/196506965-697eb2d7-8f84-4536-972b-7081e55e1ff5.png">

This won't have any effect for the users, however, as long as they use `abfs://` prefix recommended in the tutorial

Pull Request resolved: #836

Reviewed By: NivekT

Differential Revision: D40483505

Pulled By: sgrigory

fbshipit-source-id: f03373aa4b376af8ea2ac3480fc133067caaa0ce
  • Loading branch information
sgrigory authored and ejguan committed Oct 21, 2022
1 parent 6386a0b commit 76d9b84
Showing 1 changed file with 41 additions and 1 deletion.
42 changes: 41 additions & 1 deletion docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ recommend using the functional form of DataPipes.
Working with Cloud Storage Providers
---------------------------------------------

In this section, we show examples accessing AWS S3 and Google Cloud Storage with built-in``fsspec`` DataPipes.
In this section, we show examples accessing AWS S3, Google Cloud Storage, and Azure Cloud Storage with built-in ``fsspec`` DataPipes.
Although only those two providers are discussed here, with additional libraries, ``fsspec`` DataPipes
should allow you to connect with other storage systems as well (`list of known
implementations <https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations>`_).
Expand Down Expand Up @@ -384,3 +384,43 @@ directory ``applications``.
# gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-application_data.tsv, StreamWrapper<...>
# gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-continuity_data.tsv, StreamWrapper<...>
# gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-transaction_history.tsv, StreamWrapper<...>
Accessing Azure Blob storage with ``fsspec`` DataPipes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This requires the installation of the libraries ``fsspec``
(`documentation <https://filesystem-spec.readthedocs.io/en/latest/>`_) and ``adlfs``
(`adlfs GitHub repo <https://github.com/fsspec/adlfs>`_).
You can access data in Azure Data Lake Storage Gen2 by providing URIs staring with ``abfs://``.
For example,
`FSSpecFileLister <generated/torchdata.datapipes.iter.FSSpecFileLister.html>`_ (``.list_files_by_fsspec(...)``)
can be used to list files in a directory in a container:

.. code:: python
from torchdata.datapipes.iter import IterableWrapper
storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}
dp = IterableWrapper(['abfs://CONTAINER/DIRECTORY']).list_files_by_fsspec(**storage_options)
print(list(dp))
# ['abfs://container/directory/file1.txt', 'abfs://container/directory/file2.txt', ...]
You can also open files using `FSSpecFileOpener <generated/torchdata.datapipes.iter.FSSpecFileOpener.html>`_
(``.open_files_by_fsspec(...)``) and stream them
(if supported by the file format).

Here is an example of loading a CSV file ``ecdc_cases.csv`` from a public container inside the
directory ``curated/covid-19/ecdc_cases/latest``, belonging to account ``pandemicdatalake``.

.. code:: python
from torchdata.datapipes.iter import IterableWrapper
dp = IterableWrapper(['abfs://public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv']) \
.open_files_by_fsspec(account_name='pandemicdatalake') \
.parse_csv()
print(list(dp)[:3])
# [['date_rep', 'day', ..., 'iso_country', 'daterep'],
# ['2020-12-14', '14', ..., 'AF', '2020-12-14'],
# ['2020-12-13', '13', ..., 'AF', '2020-12-13']]
If necessary, you can also access data in Azure Data Lake Storage Gen1 by using URIs staring with
``adl://`` and ``abfs://``, as described in `README of adlfs repo <https://github.com/fsspec/adlfs/blob/main/README.md>`_

0 comments on commit 76d9b84

Please sign in to comment.