Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrapper around smart_open to cache transport_params for every call #335

Merged

Conversation

oplatek
Copy link
Contributor

@oplatek oplatek commented Jul 8, 2021

Motivation

The smart_open.open function when used with default value setups all necessary parameters to access e.g. S3 cloud uri.
However, it typically means that it creates a session e.g. boto3.session with client('s3') for AWS S3 access.
Creating such a session for every call is slow.
If we reused a single session for AWS S3 access for 1000 calls the Wall time was reduced from 44s to 18.9s.

This PR introduces SmartOpen singleton class
which caches transport_params and thus e.g. the client('s3')
In addition, it imports smart_open.open only once and does not import it in every call.

oplatek added 3 commits July 8, 2021 11:51
There is no nice way how to pass transport_params
to the reader and writer
The user has to call SmartOpen.setup separately
@oplatek oplatek force-pushed the transport-params-smart-open-everywhere branch from b623b3b to c75832d Compare July 8, 2021 12:15
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! Let’s extend the doc a bit and LGTM

lhotse/utils.py Outdated
client = session.client('s3')
from lhotse.utils import SmartOpen

if not slow: # switch between 44s vs 18.9 Wall time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be nit-picking -- this 44s vs 18.9s seems very specific to your use-case, could we remove it?

Copy link
Collaborator

@pzelasko pzelasko Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also -- can you format it like:

Example::

    >>> import boto3
    >>> session = boto3.Session()
    ...

This will render well with sphinx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pls check that it is ok now?

Or just tell me how to generate (and check) the docs myself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto generated docs for this PR are here https://lhotse--335.org.readthedocs.build/en/335/

RTD used to add it as a CI test result in GH but the integration broke when I moved Lhotse to lhotse-speech organization… and I couldn’t fix it.

Unfortunately it seems that the API reference doesn’t list lhotse.utils so you won’t see it there anyway.. I haven’t gotten to re-working this part of the docs yet.

lhotse/utils.py Outdated Show resolved Hide resolved
@pzelasko
Copy link
Collaborator

pzelasko commented Jul 9, 2021

Looking good, thanks!

@pzelasko pzelasko merged commit 1f7ad74 into lhotse-speech:master Jul 9, 2021
@pzelasko pzelasko added this to the v0.8 milestone Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants