Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor S3, replace high-level resource/session API with low-level client API #583

Merged
merged 20 commits into from
Mar 1, 2021
Merged
106 changes: 105 additions & 1 deletion MIGRATING_FROM_OLDER_VERSIONS.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,108 @@
Migrating to the new client-based S3 API
========================================

Version of smart_open prior to 5.0.0 used the boto3 `resource API`_ for communicating with S3.
This API was easy to integrate for smart_open developers, but this came at a cost: it was not thread- or multiprocess-safe.
Furthermore, as smart_open supported more and more options, the transport parameter list grew, making it less maintainable.

Starting with version 5.0.0, smart_open uses the `client API`_ instead of the resource API.
mpenkov marked this conversation as resolved.
Show resolved Hide resolved
piskvorky marked this conversation as resolved.
Show resolved Hide resolved
Functionally, very little changes for the smart_open user.
The only difference is in passing transport parameters to the S3 backend.

More specifically, the following S3 transport parameters are no longer supported:

- `multipart_upload_kwargs`
- `object_kwargs`
- `resource`
- `resource_kwargs`
- `session`
- `singlepart_upload_kwargs`

**If you weren't using the above parameters, nothing changes for you.**

However, if you were using any of the above, then you need to adjust your code.
Here are some quick recipes below.

If you were previously passing `session`, then construct an S3 client from the session and pass that instead.
For example, before:

.. code-block:: python

smart_open.open('s3://bucket/key', transport_params={'session': session})

After:

.. code-block:: python

smart_open.open('s3://bucket/key', transport_params={'client': session.client('s3')})

If you were passing `resource`, then replace the resource with a client, and pass that instead.
For example, before:

.. code-block:: python

resource = session.resource('s3', **resource_kwargs)
smart_open.open('s3://bucket/key', transport_params={'resource': resource})

After:

.. code-block:: python

client = session.client('s3')
smart_open.open('s3://bucket/key', transport_params={'client': client})

If you were passing any of the `*_kwargs` parameters, you will need to include them in `client_kwargs`, keeping in mind the following transformations.

========================== ====================================== ==========================
Parameter name Resource API method Client API function
========================== ====================================== ==========================
`multipart_upload_kwargs` `s3.Object.initiate_multipart_upload`_ `s3.Client.create_multipart_upload`_
`object_kwargs` `s3.Object.get`_ `s3.Client.get_object`_
`resource_kwargs` s3.resource `s3.client`_
`singlepart_upload_kwargs` `s3.Object.put`_ `s3.Client.put_object`_
========================== ====================================== ==========================

Most of the above is self-explanatory, with the exception of `resource_kwargs`.
These were previously used mostly for passing a custom endpoint URL.

The `client_kwargs` dict can thus contain the following members:
piskvorky marked this conversation as resolved.
Show resolved Hide resolved

- `s3.Client`: initializer parameters, e.g. those to pass directly to the `boto3.client` function, such as `endpoint_url`.
- `s3.Client.create_multipart_upload`
- `s3.Client.get_object`
- `s3.Client.put_object`

Here's a before-and-after example for connecting to a custom endpoint. Before:

.. code-block:: python

session = boto3.Session(profile_name='digitalocean')
resource_kwargs = {'endpoint_url': 'https://ams3.digitaloceanspaces.com'}
with open('s3://bucket/key.txt', 'wb', transport_params={'resource_kwarg': resource_kwargs}) as fout:
fout.write(b'here we stand')

After:

.. code-block:: python

session = boto3.Session(profile_name='digitalocean')
client = session.client('s3', endpoint_url='https://ams3.digitaloceanspaces.com')
with open('s3://bucket/key.txt', 'wb', transport_params={'client': client}) as fout:
fout.write(b'here we stand')

See `README <README.rst>`_ and `HOWTO <howto.md>`_ for more examples.

.. _resource API: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#service-resource
.. _s3.Object.initiate_multipart_upload: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.initiate_multipart_upload
.. _s3.Object.get: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.ObjectSummary.get
.. _s3.Object.put: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.ObjectSummary.put

.. _client API: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#client
.. _s3.Client: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#client
.. _s3.Client.create_multipart_upload: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_multipart_upload
.. _s3.Client.get_object: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.get_object
.. _s3.Client.put_object: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.put_object

Migrating to the new dependency management subsystem
====================================================

Expand Down Expand Up @@ -111,4 +216,3 @@ or view the help online `here <https://github.com/RaRe-Technologies/smart_open/b

If you pass an invalid parameter name, the ``smart_open.open`` function will warn you about it.
Keep an eye on your logs for WARNING messages from ``smart_open``.

26 changes: 12 additions & 14 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ For the sake of simplicity, the examples below assume you have all the dependenc
... aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],
... )
>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> with open(url, 'wb', transport_params={'session': session}) as fout:
>>> with open(url, 'wb', transport_params={'client': session.client('s3')}) as fout:
... bytes_written = fout.write(b'hello world!')
... print(bytes_written)
12
Expand Down Expand Up @@ -182,12 +182,9 @@ For the sake of simplicity, the examples below assume you have all the dependenc
print(line)

# Stream to Digital Ocean Spaces bucket providing credentials from boto3 profile
transport_params = {
'session': boto3.Session(profile_name='digitalocean'),
'resource_kwargs': {
'endpoint_url': 'https://ams3.digitaloceanspaces.com',
}
}
session = boto3.Session(profile_name='digitalocean')
client = session.client('s3', endpoint_url='https://ams3.digitaloceanspaces.com')
transport_params = {'client': client}
with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout:
fout.write(b'here we stand')

Expand All @@ -202,15 +199,15 @@ For the sake of simplicity, the examples below assume you have all the dependenc
# stream from Azure Blob Storage
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
client: azure.storage.blob.BlobServiceClient.from_connection_string(connect_str)
'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
for line in open('azure://mycontainer/myfile.txt', transport_params=transport_params):
print(line)

# stream content *into* Azure Blob Storage (write mode):
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
client: azure.storage.blob.BlobServiceClient.from_connection_string(connect_str)
'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
with open('azure://mycontainer/my_file.txt', 'wb', transport_params=transport_params) as fout:
fout.write(b'hello world')
Expand Down Expand Up @@ -264,7 +261,7 @@ Here are some examples of using this parameter:
.. code-block:: python

>>> import boto3
>>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(session=boto3.Session()))
>>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(client=boto3.client('s3')))
>>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(buffer_size=1024))

For the full list of keyword arguments supported by each transport option, see the documentation:
Expand All @@ -281,8 +278,8 @@ S3 Credentials
By default, ``smart_open`` will defer to ``boto3`` and let the latter take care of the credentials.
There are several ways to override this behavior.

The first is to pass a ``boto3.Session`` object as a transport parameter to the ``open`` function.
You can customize the credentials when constructing the session.
The first is to pass a ``boto3.Client`` object as a transport parameter to the ``open`` function.
You can customize the credentials when constructing the session for the client.
``smart_open`` will then use the session when talking to S3.

.. code-block:: python
Expand All @@ -292,15 +289,16 @@ You can customize the credentials when constructing the session.
aws_secret_access_key=SECRET_KEY,
aws_session_token=SESSION_TOKEN,
)
fin = open('s3://bucket/key', transport_params=dict(session=session), ...)
client = session.client('s3', endpoint_url=..., config=...)
fin = open('s3://bucket/key', transport_params=dict(client=client))

Your second option is to specify the credentials within the S3 URL itself:

.. code-block:: python

fin = open('s3://aws_access_key_id:aws_secret_access_key@bucket/key', ...)

*Important*: The two methods above are **mutually exclusive**. If you pass an AWS session *and* the URL contains credentials, ``smart_open`` will ignore the latter.
*Important*: The two methods above are **mutually exclusive**. If you pass an AWS client *and* the URL contains credentials, ``smart_open`` will ignore the latter.

*Important*: ``smart_open`` ignores configuration files from the older ``boto`` library.
Port your old ``boto`` settings to ``boto3`` in order to use them with ``smart_open``.
Expand Down
25 changes: 9 additions & 16 deletions help.txt
Original file line number Diff line number Diff line change
Expand Up @@ -137,17 +137,6 @@ FUNCTIONS
The buffer size to use when performing I/O.
min_part_size: int, optional
The minimum part size for multipart uploads. For writing only.
session: object, optional
The S3 session to use when working with boto3.
resource_kwargs: dict, optional
Keyword arguments to use when accessing the S3 resource for reading or writing.
multipart_upload_kwargs: dict, optional
Additional parameters to pass to boto3's initiate_multipart_upload function.
For writing only.
singlepart_upload_kwargs: dict, optional
Additional parameters to pass to boto3's S3.Object.put function when using single
part upload.
For writing only.
multipart_upload: bool, optional
Default: `True`
If set to `True`, will use multipart upload for writing to S3. If set
Expand All @@ -157,14 +146,18 @@ FUNCTIONS
version_id: str, optional
Version of the object, used when reading object.
If None, will fetch the most recent version.
object_kwargs: dict, optional
Additional parameters to pass to boto3's object.get function.
Used during reading only.
defer_seek: boolean, optional
Default: `False`
If set to `True` on a file opened for reading, GetObject will not be
called until the first seek() or read().
Avoids redundant API queries when seeking before reading.
client: object, optional
The S3 client to use when working with boto3.
If you don't specify this, then smart_open will create a new client for you.
client_kwargs: dict, optional
Additional parameters to pass to the relevant functions of the client.
The keys are fully qualified method names, e.g. `S3.Client.create_multipart_upload`.
The values are kwargs to pass to that method each time it is called.
writebuffer: IO[bytes], optional
By default, this module will buffer data in memory using io.BytesIO
when writing. Pass another binary IO instance here to use it instead.
Expand Down Expand Up @@ -325,13 +318,13 @@ FUNCTIONS
s3_iter_bucket(bucket_name, prefix='', accept_key=None, key_limit=None, workers=16, retries=3, **session_kwargs)
Deprecated. Use smart_open.s3.iter_bucket instead.

smart_open(uri, mode='rb', **kw)
smart_open(uri, mode='rb', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None, ignore_extension=False, **kwargs)

DATA
__all__ = ['open', 'parse_uri', 'register_compressor', 's3_iter_bucket...

VERSION
2.2.1
4.1.2.dev0

FILE
/Users/misha/git/smart_open/smart_open/__init__.py
Expand Down
Loading