Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Medium-scale refactoring #408

Merged
merged 35 commits into from
Mar 29, 2020
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
fcd49f6
Medium-scale refactoring
mpenkov Jan 11, 2020
656d2b4
more refactoring
mpenkov Jan 11, 2020
af35d24
automate docstrings
mpenkov Jan 12, 2020
57c459f
link to extending.md from README.rst
mpenkov Jan 12, 2020
3961dbb
fixup
mpenkov Jan 12, 2020
463b060
improve my_urlsplit function name
mpenkov Jan 12, 2020
4f287df
improve docstring
mpenkov Jan 12, 2020
3dcb71a
remove unused variable
mpenkov Jan 12, 2020
4ee4490
fixup
mpenkov Jan 12, 2020
2c8a4f2
disable docstring tweaking on Py2
mpenkov Jan 12, 2020
f489689
more Py27 goodness
mpenkov Jan 12, 2020
b22e3b0
add section to extending.md
mpenkov Jan 12, 2020
b09e03f
Merge remote-tracking branch 'upstream/master' into uri
mpenkov Jan 30, 2020
1cc60ea
improving transport submodule registration
mpenkov Jan 30, 2020
4d3b1a7
integrating gcs into new design
mpenkov Jan 30, 2020
64f43f0
disable moto server by default
mpenkov Jan 30, 2020
6110269
import submodules via importlib for flexibility
mpenkov Mar 27, 2020
9070547
Merge remote-tracking branch 'upstream/master' into uri
mpenkov Mar 27, 2020
abf4fef
move tweak function to doctools
mpenkov Mar 27, 2020
110a557
split out separate transport.py submodule
mpenkov Mar 27, 2020
7d67db8
Merge remote-tracking branch 'upstream/master' into uri
mpenkov Mar 27, 2020
12605ab
fixup
mpenkov Mar 27, 2020
64b2fdd
get rid of Py2
mpenkov Mar 27, 2020
98ded35
get rid of Py2, for real this time
mpenkov Mar 27, 2020
903bfd0
get rid of unused imports
mpenkov Mar 27, 2020
f5dc67f
still more Py2 removal
mpenkov Mar 27, 2020
b309d58
remove unused imports
mpenkov Mar 27, 2020
a936bea
warn on missing docstrings
mpenkov Mar 27, 2020
0720cfc
docstring before and after newline
mpenkov Mar 27, 2020
caf5a42
add doc links to submodules
mpenkov Mar 27, 2020
df7aee7
remove useless comment in setup.py
mpenkov Mar 27, 2020
c1be8de
improve examples
mpenkov Mar 27, 2020
0f6d5e4
split out utils and constants submodules
mpenkov Mar 28, 2020
6d7a73a
split out concurrency submodule
mpenkov Mar 28, 2020
f7a4df0
update extending.md
mpenkov Mar 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ How?
... break
'<!doctype html>\n'

.. _doctools_after_examples:

Other examples of URLs that ``smart_open`` accepts::

s3://my_bucket/my_key
Expand All @@ -96,8 +98,6 @@ Other examples of URLs that ``smart_open`` accepts::
[ssh|scp|sftp]://username@host/path/file
[ssh|scp|sftp]://username:password@host/path/file

.. _doctools_after_examples:


Documentation
=============
Expand Down Expand Up @@ -407,6 +407,11 @@ This can be helpful when e.g. working with compressed files.
... print(infile.readline()[:41])
В начале июля, в чрезвычайно жаркое время

Extending ``smart_open``
========================

See `this document <extending.md>`__.

Comments, bug reports
=====================

Expand Down
130 changes: 130 additions & 0 deletions extending.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Extending `smart_open`
mpenkov marked this conversation as resolved.
Show resolved Hide resolved

This document targets potential contributors to `smart_open`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about actions in case if I want to add new format, but without pushing them to smart_open (for example, reader for proprietary stuff in the company, useless for open-source).

Currently, there are two main directions for extending existing `smart_open` functionality:

1. Add a new transport mechanism
2. Add a new compression format

The first is by far the more challenging, and also the more welcome.

## New transport mechanisms

Each transport mechanism lives in its own submodule.
For example, currently we have:

- `smart_open.file`
mpenkov marked this conversation as resolved.
Show resolved Hide resolved
- `smart_open.s3`
- `smart_open.ssh`
- ... and others

So, to implement a new transport mechanism, you need to create a new module.
Your module must expose the following:

```python
SCHEMA = ...
mpenkov marked this conversation as resolved.
Show resolved Hide resolved
"""The name of the mechanism, e.g. s3, ssh, etc.

This is the part that goes before the `://` in a URL, e.g. `s3://`."""

URI_EXAMPLES = ('xxx://foo/bar', 'zzz://baz/boz')
"""This will appear in the documentation of the the `parse_uri` function."""


def parse_uri(uri_as_str):
"""Parse the specified URI into a dict.

At a bare minimum, the dict must have `schema` member.
"""
return dict(schema=XXX_SCHEMA, ...)


def open_uri(uri_as_str, mode, transport_params):
"""Return a file-like object pointing to the URI.

Parameters:

uri_as_str: str
The URI to open
mode: str
Either "rb" or "wb". You don't need to implement text modes,
`smart_open` does that for you, outside of the transport layer.
transport_params: dict
Any additional parameters to pass to the `open` function (see below).

"""
#
# Parse the URI using parse_uri
# Consolidate the parsed URI with transport_params, if needed
# Pass everything to the open function (see below).
#
...


def open(..., mode, param1=None, param2=None, paramN=None):
"""This function does the hard work.

The keyword parameters are the transport_params from the `open_uri`
function.

"""
...
```

Have a look at the existing mechanisms to see how they work.
You may define other functions and classes as necessary for your implementation.

Once your module is working, register it in the `smart_open/smart_open_lib.py` file.
mpenkov marked this conversation as resolved.
Show resolved Hide resolved
The `_generate_transport()` generator builds a dictionary that maps schemes to the modules that implement functionality for them.

Once you've registered your new transport module, the following will happen automagically:

1. `smart_open` will be able to open any URI supported by your module
2. The docstring for the `smart_open.open` function will contain a section
detailing the parameters for your transport module.
3. The docstring for the `parse_uri` function will include the schemas and
examples supported by your module.

You can confirm the documentation changes by running:

python -c 'help("smart_open")'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to submodule should be here, I guess?


### What's the difference between the `open_uri` and `open` functions?

There are several key differences between the two.

First, the parameters to `open_uri` are the same for _all transports_.
On the other hand, the parameters to the `open` function can differ from transport to transport.

Second, the responsibilities of the two functions are also different.
The `open` function opens the remote object.
The `open_uri` function deals with parsing transport-specific details out of the URI, and then delegates to `open`.

The `open` function contains documentation for transport parameters.
This documentation gets parsed by the `doctools` module and appears in various docstrings.

Some of these differences are by design; others as a consequence of evolution.

## New compression mechanisms

The compression layer is self-contained in the `smart_open.compression` submodule.

To add support for a new compressor:

- Create a new function to handle your compression format (given an extension)
- Add your compressor to the registry

For example:

```python
def _handle_xz(file_obj, mode):
import lzma
return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ)


register_compressor('.xz', _handle_xz)
menshikh-iv marked this conversation as resolved.
Show resolved Hide resolved
```

There are many compression formats out there, and supporting all of them is beyond the scope of `smart_open`.
We want our code's functionality to cover the bare minimum required to satisfy 80% of our users.
We leave the remaining 20% of users with the ability to deal with compression in their own code, using the trivial mechanism described above.
12 changes: 10 additions & 2 deletions smart_open/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
The main functions are:

* `open()`, which opens the given file for reading/writing
* `parse_uri()`
* `s3_iter_bucket()`, which goes over all keys in an S3 bucket in parallel
* `register_compressor()`, which registers callbacks for transparent compressor handling

Expand All @@ -24,9 +25,16 @@
import logging
from smart_open import version

from .smart_open_lib import open, smart_open, register_compressor
from .smart_open_lib import open, parse_uri, smart_open, register_compressor
from .s3 import iter_bucket as s3_iter_bucket
__all__ = ['open', 'smart_open', 's3_iter_bucket', 'register_compressor']

__all__ = [
'open',
'parse_uri',
'register_compressor',
's3_iter_bucket',
'smart_open',
]


__version__ = version.__version__
Expand Down
126 changes: 126 additions & 0 deletions smart_open/compression.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# -*- coding: utf-8 -*-
#
# Copyright (C) 2020 Radim Rehurek <me@radimrehurek.com>
#
# This code is distributed under the terms and conditions
# from the MIT License (MIT).
#
"""Implements the compression layer of the ``smart_open`` library."""
import io
import logging
import os.path
import warnings

import six

logger = logging.getLogger(__name__)


_COMPRESSOR_REGISTRY = {}
_ISSUE_189_URL = 'https://github.com/RaRe-Technologies/smart_open/issues/189'


def get_supported_extensions():
"""Return the list of file extensions for which we have registered compressors."""
return sorted(_COMPRESSOR_REGISTRY.keys())


def register_compressor(ext, callback):
"""Register a callback for transparently decompressing files with a specific extension.

Parameters
----------
ext: str
The extension. Must include the leading period, e.g. ``.gz``.
callback: callable
The callback. It must accept two position arguments, file_obj and mode.
mpenkov marked this conversation as resolved.
Show resolved Hide resolved
This function will be called when ``smart_open`` is opening a file with
the specified extension.

Examples
--------

Instruct smart_open to use the `lzma` module whenever opening a file
with a .xz extension (see README.rst for the complete example showing I/O):

>>> def _handle_xz(file_obj, mode):
... import lzma
... return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ)
>>>
>>> register_compressor('.xz', _handle_xz)

"""
if not (ext and ext[0] == '.'):
raise ValueError('ext must be a string starting with ., not %r' % ext)
if ext in _COMPRESSOR_REGISTRY:
logger.warning('overriding existing compression handler for %r', ext)
_COMPRESSOR_REGISTRY[ext] = callback


def _handle_bz2(file_obj, mode):
if six.PY2:
from bz2file import BZ2File
else:
from bz2 import BZ2File
return BZ2File(file_obj, mode)


def _handle_gzip(file_obj, mode):
import gzip
return gzip.GzipFile(fileobj=file_obj, mode=mode)


def compression_wrapper(file_obj, mode):
"""
This function will wrap the file_obj with an appropriate
[de]compression mechanism based on the extension of the filename.

file_obj must either be a filehandle object, or a class which behaves
like one. It must have a .name attribute.

If the filename extension isn't recognized, will simply return the original
file_obj.
"""

try:
_, ext = os.path.splitext(file_obj.name)
except (AttributeError, TypeError):
logger.warning(
'unable to transparently decompress %r because it '
'seems to lack a string-like .name', file_obj
)
return file_obj

if _need_to_buffer(file_obj, mode, ext):
warnings.warn('streaming gzip support unavailable, see %s' % _ISSUE_189_URL)
file_obj = io.BytesIO(file_obj.read())
if ext in _COMPRESSOR_REGISTRY and mode.endswith('+'):
raise ValueError('transparent (de)compression unsupported for mode %r' % mode)

try:
callback = _COMPRESSOR_REGISTRY[ext]
except KeyError:
return file_obj
else:
return callback(file_obj, mode)


def _need_to_buffer(file_obj, mode, ext):
"""Returns True if we need to buffer the whole file in memory in order to proceed."""
try:
is_seekable = file_obj.seekable()
except AttributeError:
#
# Under Py2, built-in file objects returned by open do not have
# .seekable, but have a .seek method instead.
#
is_seekable = hasattr(file_obj, 'seek')
is_compressed = ext in _COMPRESSOR_REGISTRY
return six.PY2 and mode.startswith('r') and is_compressed and not is_seekable


#
# NB. avoid using lambda here to make stack traces more readable.
#
register_compressor('.bz2', _handle_bz2)
register_compressor('.gz', _handle_gzip)
61 changes: 61 additions & 0 deletions smart_open/doctools.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,12 @@
import os.path
import re

import six

from . import compression

_NO_SCHEME = ''


def extract_kwargs(docstring):
"""Extract keyword argument documentation from a function's docstring.
Expand Down Expand Up @@ -156,3 +162,58 @@ def extract_examples_from_readme_rst(indent=' '):
return ''.join([indent + re.sub('^ ', '', l) for l in lines])
except Exception:
return indent + 'See README.rst'


def tweak_docstrings(open_function, parse_uri_function, transport):
#
# The code below doesn't work on Py2. We _could_ make it work, but given
# that it's 2020 and Py2 is on it's way out, I'm just going to disable it.
#
if six.PY2:
return

substrings = {}
schemes = io.StringIO()
seen_examples = set()
uri_examples = io.StringIO()

for scheme, transport in sorted(transport.items()):
if scheme == _NO_SCHEME:
continue

schemes.write(' * %s\n' % scheme)

try:
fn = transport.open
except AttributeError:
substrings[scheme] = ''
else:
kwargs = extract_kwargs(fn.__doc__)
substrings[scheme] = to_docstring(kwargs, lpad=u' ')

try:
examples = transport.URI_EXAMPLES
except AttributeError:
continue
else:
for e in examples:
if e not in seen_examples:
uri_examples.write(' * %s\n' % e)
seen_examples.add(e)

substrings['codecs'] = '\n'.join(
[' * %s' % e for e in compression.get_supported_extensions()]
)
substrings['examples'] = extract_examples_from_readme_rst()

#
# The docstring can be None if -OO was passed to the interpreter.
mpenkov marked this conversation as resolved.
Show resolved Hide resolved
#
if open_function.__doc__:
open_function.__doc__ = open_function.__doc__ % substrings

if parse_uri_function.__doc__:
parse_uri_function.__doc__ = parse_uri_function.__doc__ % dict(
schemes=schemes.getvalue(),
uri_examples=uri_examples.getvalue(),
)
Loading