Skip to content

Commit

Permalink
feat(clients): add support for S3 bucket storage (#54)
Browse files Browse the repository at this point in the history
* feat(clients): support for "s3://bucket/blobs"

 - install with `pip install pathy[s3]` extras
 - update tests to substitute the scheme where needed instead of hardcoded "gs"

* chore: cleanup from review

* test(s3): exercise list_buckets and pagination codepaths

* chore: fix passing credentials through the env

* chore: fix leftover dev change in conftest

* chore: tighten up missing extras tests

 - specifically disable use_fs incase it was still set from another test

* chore: fix passing s3 creds to smart_open

* chore: cleanup from review

 - add env vars for s3

* chore: cleanup from review

* chore(readme): add cloud platform support table
  • Loading branch information
justindujardin authored Apr 23, 2021
1 parent 8ad43b5 commit 5bb7e1b
Show file tree
Hide file tree
Showing 12 changed files with 517 additions and 117 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ jobs:
- name: Test Wheel
env:
GCS_CREDENTIALS: ${{ secrets.GCS_CREDENTIALS }}
PATHY_S3_ACCESS_ID: ${{ secrets.PATHY_S3_ACCESS_ID }}
PATHY_S3_ACCESS_SECRET: ${{ secrets.PATHY_S3_ACCESS_SECRET }}
run: rm -rf ./pathy/ && sh tools/test_package.sh
- name: Report Code Coverage
env:
Expand Down
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
[![Pypi version](https://badgen.net/pypi/v/pathy)](https://pypi.org/project/pathy/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)

Pathy is a python package (_with type annotations_) for working with Bucket storage providers. It provides a CLI app for basic file operations between local files and remote buckets. It enables a smooth developer experience by supporting local file-system backed buckets during development and testing. It makes converting bucket blobs into local files a snap with optional local file caching of blobs.
Pathy is a python package (_with type annotations_) for working with Cloud Bucket storage providers using a pathlib interface. It provides an easy-to-use API bundled with a CLI app for basic file operations between local files and remote buckets. It enables a smooth developer experience by letting developers work against the local file system during development and only switch over to live APIs for deployment. It also makes converting bucket blobs into local files a snap with optional local file caching.

## 🚀 Quickstart

Expand All @@ -15,7 +15,7 @@ You can install `pathy` from pip:
pip install pathy
```

The package exports the `Pathy` class and utilities for configuring the bucket storage provider to use. By default Pathy prefers GoogleCloudStorage paths of the form `gs://bucket_name/folder/blob_name.txt`. Internally Pathy can convert GCS paths to local files, allowing for a nice developer experience.
The package exports the `Pathy` class and utilities for configuring the bucket storage provider to use.

```python
from pathy import Pathy, use_fs
Expand All @@ -37,6 +37,16 @@ greeting.unlink()
assert not greeting.exists()
```

## Supported Clouds

The table below details the supported cloud provider APIs.

| Cloud Service | Support | Install Extras |
| :------------------- | :-----: | :----------------------: |
| Google Cloud Storage || `pip install pathy[gcs]` |
| Amazon S3 || `pip install pathy[s3]` |
| Azure || |

## Semantic Versioning

Before Pathy reaches v1.0 the project is not guaranteed to have a consistent API, which means that types and classes may move around or be removed. That said, we try to be predictable when it comes to breaking changes, so the project uses semantic versioning to help users avoid breakage.
Expand Down Expand Up @@ -109,7 +119,7 @@ assert fluid_path.prefix == "foo.txt/"
## from_bucket <kbd>classmethod</kbd>

```python (doc)
Pathy.from_bucket(bucket_name: str) -> 'Pathy'
Pathy.from_bucket(bucket_name: str, scheme: str = 'gs') -> 'Pathy'
```

Initialize a Pathy from a bucket name. This helper adds a trailing slash and
Expand Down
7 changes: 4 additions & 3 deletions pathy/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -366,7 +366,7 @@ def stat(self, path: "Pathy") -> BlobStat:
blob: Optional[Blob] = bucket.get_blob(str(path.key))
if blob is None:
raise FileNotFoundError(path)
return BlobStat(name=str(blob), size=blob.size, last_modified=blob.updated)
return BlobStat(name=str(blob.name), size=blob.size, last_modified=blob.updated)

def is_dir(self, path: "Pathy") -> bool:
if self.get_blob(path) is not None:
Expand Down Expand Up @@ -516,7 +516,7 @@ def fluid(cls, path_candidate: Union[str, FluidPath]) -> FluidPath:
return from_path

@classmethod
def from_bucket(cls, bucket_name: str) -> "Pathy":
def from_bucket(cls, bucket_name: str, scheme: str = "gs") -> "Pathy":
"""Initialize a Pathy from a bucket name. This helper adds a trailing slash and
the appropriate prefix.
Expand All @@ -527,7 +527,7 @@ def from_bucket(cls, bucket_name: str) -> "Pathy":
assert str(Pathy.from_bucket("two")) == "gs://two/"
```
"""
return Pathy(f"gs://{bucket_name}/") # type:ignore
return Pathy(f"{scheme}://{bucket_name}/") # type:ignore

@classmethod
def to_local(cls, blob_path: Union["Pathy", str], recurse: bool = True) -> Path:
Expand Down Expand Up @@ -1159,6 +1159,7 @@ def scandir(self) -> Generator[BucketEntry, None, None]:
# a Pathy object with a matching scheme
_optional_clients: Dict[str, str] = {
"gs": "pathy.gcs",
"s3": "pathy.s3",
}
BucketClientType = TypeVar("BucketClientType", bound=BucketClient)

Expand Down
8 changes: 8 additions & 0 deletions pathy/_tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,11 @@
has_gcs = bool(BucketClientGCS)
except ImportError:
has_gcs = False

has_s3: bool
try:
from ..s3 import BucketClientS3

has_s3 = bool(BucketClientS3)
except ImportError:
has_s3 = False
24 changes: 22 additions & 2 deletions pathy/_tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import sys
import tempfile
from pathlib import Path
from typing import Any, Generator, Optional
from typing import Any, Generator, Optional, Tuple

import pytest

Expand All @@ -15,7 +15,7 @@
has_credentials = "GCS_CREDENTIALS" in os.environ

# Which adapters to use
TEST_ADAPTERS = ["gcs", "fs"] if has_credentials and has_gcs else ["fs"]
TEST_ADAPTERS = ["gcs", "s3", "fs"] if has_credentials and has_gcs else ["fs"]
# A unique identifier used to allow each python version and OS to test
# with separate bucket paths. This makes it possible to parallelize the
# tests.
Expand Down Expand Up @@ -82,6 +82,18 @@ def gcs_credentials_from_env() -> Optional[Any]:
return credentials


def s3_credentials_from_env() -> Optional[Tuple[str, str]]:
"""Extract an access key ID and Secret from the environment."""
if not has_gcs:
return None

access_key_id: Optional[str] = os.environ.get("PATHY_S3_ACCESS_ID", None)
access_secret: Optional[str] = os.environ.get("PATHY_S3_ACCESS_SECRET", None)
if access_key_id is None or access_secret is None:
return None
return (access_key_id, access_secret)


@pytest.fixture()
def with_adapter(
adapter: str, bucket: str, other_bucket: str
Expand All @@ -94,6 +106,14 @@ def with_adapter(
credentials = gcs_credentials_from_env()
if credentials is not None:
set_client_params("gs", credentials=credentials)
elif adapter == "s3":
scheme = "s3"
# Use boto3
use_fs(False)
credentials = s3_credentials_from_env()
if credentials is not None:
key_id, key_secret = credentials
set_client_params("s3", key_id=key_id, key_secret=key_secret)
elif adapter == "fs":
# Use local file-system in a temp folder
tmp_dir = tempfile.mkdtemp()
Expand Down
46 changes: 24 additions & 22 deletions pathy/_tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,16 @@

@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_cli_cp_invalid_from_path(with_adapter: str, bucket: str) -> None:
source = f"gs://{bucket}/{ENV_ID}/cli_cp_file_invalid/file.txt"
destination = f"gs://{bucket}/{ENV_ID}/cli_cp_file_invalid/dest.txt"
source = f"{with_adapter}://{bucket}/{ENV_ID}/cli_cp_file_invalid/file.txt"
destination = f"{with_adapter}://{bucket}/{ENV_ID}/cli_cp_file_invalid/dest.txt"
assert runner.invoke(app, ["cp", source, destination]).exit_code == 1
assert not Pathy(destination).is_file()


@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_cli_cp_file(with_adapter: str, bucket: str) -> None:
source = f"gs://{bucket}/{ENV_ID}/cli_cp_file/file.txt"
destination = f"gs://{bucket}/{ENV_ID}/cli_cp_file/other.txt"
source = f"{with_adapter}://{bucket}/{ENV_ID}/cli_cp_file/file.txt"
destination = f"{with_adapter}://{bucket}/{ENV_ID}/cli_cp_file/other.txt"
Pathy(source).write_text("---")
assert runner.invoke(app, ["cp", source, destination]).exit_code == 0
assert Pathy(source).exists()
Expand All @@ -37,15 +37,15 @@ def test_cli_cp_file(with_adapter: str, bucket: str) -> None:
def test_cli_cp_file_name_from_source(with_adapter: str, bucket: str) -> None:
source = pathlib.Path("./file.txt")
source.touch()
destination = f"gs://{bucket}/{ENV_ID}/cli_cp_file/"
destination = f"{with_adapter}://{bucket}/{ENV_ID}/cli_cp_file/"
assert runner.invoke(app, ["cp", str(source), destination]).exit_code == 0
assert Pathy(f"{destination}file.txt").is_file()
source.unlink()


@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_cli_cp_folder(with_adapter: str, bucket: str) -> None:
root = Pathy.from_bucket(bucket) / ENV_ID
root = Pathy(f"{with_adapter}://{bucket}/{ENV_ID}/")
source = root / "cli_cp_folder"
destination = root / "cli_cp_folder_other"
for i in range(2):
Expand All @@ -61,7 +61,7 @@ def test_cli_cp_folder(with_adapter: str, bucket: str) -> None:

@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_cli_mv_folder(with_adapter: str, bucket: str) -> None:
root = Pathy.from_bucket(bucket) / ENV_ID
root = Pathy(f"{with_adapter}://{bucket}/{ENV_ID}/")
source = root / "cli_mv_folder"
destination = root / "cli_mv_folder_other"
for i in range(2):
Expand All @@ -84,7 +84,7 @@ def test_cli_mv_folder(with_adapter: str, bucket: str) -> None:
def test_cli_mv_file_copy_from_name(with_adapter: str, bucket: str) -> None:
source = pathlib.Path("./file.txt")
source.touch()
destination = f"gs://{bucket}/{ENV_ID}/cli_cp_file/"
destination = f"{with_adapter}://{bucket}/{ENV_ID}/cli_cp_file/"
assert runner.invoke(app, ["mv", str(source), destination]).exit_code == 0
assert Pathy(f"{destination}file.txt").is_file()
# unlink should happen from the operation
Expand All @@ -93,8 +93,8 @@ def test_cli_mv_file_copy_from_name(with_adapter: str, bucket: str) -> None:

@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_cli_mv_file(with_adapter: str, bucket: str) -> None:
source = f"gs://{bucket}/{ENV_ID}/cli_mv_file/file.txt"
destination = f"gs://{bucket}/{ENV_ID}/cli_mv_file/other.txt"
source = f"{with_adapter}://{bucket}/{ENV_ID}/cli_mv_file/file.txt"
destination = f"{with_adapter}://{bucket}/{ENV_ID}/cli_mv_file/other.txt"
Pathy(source).write_text("---")
assert Pathy(source).exists()
assert runner.invoke(app, ["mv", source, destination]).exit_code == 0
Expand All @@ -106,8 +106,10 @@ def test_cli_mv_file(with_adapter: str, bucket: str) -> None:
def test_cli_mv_file_across_buckets(
with_adapter: str, bucket: str, other_bucket: str
) -> None:
source = f"gs://{bucket}/{ENV_ID}/cli_mv_file_across_buckets/file.txt"
destination = f"gs://{other_bucket}/{ENV_ID}/cli_mv_file_across_buckets/other.txt"
source = f"{with_adapter}://{bucket}/{ENV_ID}/cli_mv_file_across_buckets/file.txt"
destination = (
f"{with_adapter}://{other_bucket}/{ENV_ID}/cli_mv_file_across_buckets/other.txt"
)
Pathy(source).write_text("---")
assert Pathy(source).exists()
assert runner.invoke(app, ["mv", source, destination]).exit_code == 0
Expand All @@ -119,9 +121,9 @@ def test_cli_mv_file_across_buckets(
def test_cli_mv_folder_across_buckets(
with_adapter: str, bucket: str, other_bucket: str
) -> None:
source = Pathy.from_bucket(bucket) / ENV_ID / "cli_mv_folder_across_buckets"
destination = (
Pathy.from_bucket(other_bucket) / ENV_ID / "cli_mv_folder_across_buckets"
source = Pathy(f"{with_adapter}://{bucket}/{ENV_ID}/cli_mv_folder_across_buckets")
destination = Pathy(
f"{with_adapter}://{other_bucket}/{ENV_ID}/cli_mv_folder_across_buckets"
)
for i in range(2):
for j in range(2):
Expand All @@ -141,15 +143,15 @@ def test_cli_mv_folder_across_buckets(

@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_cli_rm_invalid_file(with_adapter: str, bucket: str) -> None:
source = f"gs://{bucket}/{ENV_ID}/cli_rm_file_invalid/file.txt"
source = f"{with_adapter}://{bucket}/{ENV_ID}/cli_rm_file_invalid/file.txt"
path = Pathy(source)
assert not path.exists()
assert runner.invoke(app, ["rm", source]).exit_code == 1


@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_cli_rm_file(with_adapter: str, bucket: str) -> None:
source = f"gs://{bucket}/{ENV_ID}/cli_rm_file/file.txt"
source = f"{with_adapter}://{bucket}/{ENV_ID}/cli_rm_file/file.txt"
path = Pathy(source)
path.write_text("---")
assert path.exists()
Expand All @@ -159,7 +161,7 @@ def test_cli_rm_file(with_adapter: str, bucket: str) -> None:

@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_cli_rm_verbose(with_adapter: str, bucket: str) -> None:
root = Pathy.from_bucket(bucket) / ENV_ID / "cli_rm_folder"
root = Pathy(f"{with_adapter}://{bucket}/{ENV_ID}/") / "cli_rm_folder"
source = str(root / "file.txt")
other = str(root / "folder/other")
Pathy(source).write_text("---")
Expand All @@ -178,7 +180,7 @@ def test_cli_rm_verbose(with_adapter: str, bucket: str) -> None:

@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_cli_rm_folder(with_adapter: str, bucket: str) -> None:
root = Pathy.from_bucket(bucket) / ENV_ID
root = Pathy(f"{with_adapter}://{bucket}/{ENV_ID}/")
source = root / "cli_rm_folder"
for i in range(2):
for j in range(2):
Expand All @@ -196,7 +198,7 @@ def test_cli_rm_folder(with_adapter: str, bucket: str) -> None:

@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_cli_ls_invalid_source(with_adapter: str, bucket: str) -> None:
root = Pathy.from_bucket(bucket) / ENV_ID / "cli_ls_invalid"
root = Pathy(f"{with_adapter}://{bucket}/{ENV_ID}/") / "cli_ls_invalid"
three = str(root / "folder/file.txt")

result = runner.invoke(app, ["ls", str(three)])
Expand All @@ -206,7 +208,7 @@ def test_cli_ls_invalid_source(with_adapter: str, bucket: str) -> None:

@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_cli_ls(with_adapter: str, bucket: str) -> None:
root = Pathy.from_bucket(bucket) / ENV_ID / "cli_ls"
root = Pathy(f"{with_adapter}://{bucket}/{ENV_ID}/") / "cli_ls"
one = str(root / "file.txt")
two = str(root / "other.txt")
three = str(root / "folder/file.txt")
Expand Down Expand Up @@ -241,7 +243,7 @@ def test_cli_ls_local_files(with_adapter: str, bucket: str) -> None:
assert blob_stat.size == 4
assert blob_stat.last_modified is not None

root = Pathy.from_bucket(bucket) / ENV_ID / "cli_ls"
root = Pathy(f"{with_adapter}://{bucket}/{ENV_ID}/") / "cli_ls"
one = str(root / "file.txt")
two = str(root / "other.txt")
three = str(root / "folder/file.txt")
Expand Down
2 changes: 1 addition & 1 deletion pathy/_tests/test_clients.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def test_clients_use_fs(with_fs: Path) -> None:

@pytest.mark.parametrize("adapter", TEST_ADAPTERS)
def test_api_use_fs_cache(with_adapter: str, with_fs: str, bucket: str) -> None:
path = Pathy(f"gs://{bucket}/{ENV_ID}/directory/foo.txt")
path = Pathy(f"{with_adapter}://{bucket}/{ENV_ID}/directory/foo.txt")
path.write_text("---")
assert isinstance(path, Pathy)
with pytest.raises(ValueError):
Expand Down
8 changes: 3 additions & 5 deletions pathy/_tests/test_gcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@

import pytest

from pathy import Pathy, get_client, set_client_params

from pathy import Pathy, get_client, set_client_params, use_fs

from . import has_gcs

Expand Down Expand Up @@ -39,6 +38,7 @@ def test_gcs_as_uri(with_adapter: str, bucket: str) -> None:

@pytest.mark.skipif(has_gcs, reason="requires gcs deps to NOT be installed")
def test_gcs_import_error_missing_deps() -> None:
use_fs(False)
with pytest.raises(ImportError):
get_client("gs")

Expand All @@ -58,9 +58,7 @@ def test_gcs_scandir_list_buckets(

@pytest.mark.parametrize("adapter", GCS_ADAPTER)
@pytest.mark.skipif(not has_gcs, reason="requires gcs")
def test_gcs_scandir_invalid_bucket_name(
with_adapter: str, bucket: str, other_bucket: str
) -> None:
def test_gcs_scandir_invalid_bucket_name(with_adapter: str) -> None:
from pathy.gcs import ScanDirGCS

root = Pathy("gs://invalid_h3gE_ds5daEf_Sdf15487t2n4/bar")
Expand Down
Loading

0 comments on commit 5bb7e1b

Please sign in to comment.