diff --git a/README.rst b/README.rst index 48446c59..d01353ed 100644 --- a/README.rst +++ b/README.rst @@ -1,9 +1,8 @@ -============================================= -smart_open -- utils for streaming large files -============================================= - -|License|_ |Travis|_ +====================================================== +smart_open — utils for streaming large files in Python +====================================================== +|License|_ |Travis|_ .. |License| image:: https://img.shields.io/pypi/l/smart_open.svg .. |Travis| image:: https://travis-ci.org/RaRe-Technologies/smart_open.svg?branch=master @@ -13,150 +12,145 @@ smart_open -- utils for streaming large files What? ===== -``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to S3, HDFS, WebHDFS, HTTP, or local (compressed) files. -It is well tested (using `moto `_), well documented and sports a simple, Pythonic API: +``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to S3, HDFS, WebHDFS, HTTP, or local (compressed) files. It's a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top. + +``smart_open`` is well-tested, well-documented and sports a simple, Pythonic API: .. code-block:: python - >>> # stream lines from an S3 object - >>> for line in smart_open.smart_open('s3://mybucket/mykey.txt'): - ... print line + >>> from smart_open import smart_open - >>> # using a completely custom s3 server, like s3proxy: - >>> for line in smart_open.smart_open('s3u://user:secret@host:port@mybucket/mykey.txt'): - ... print line + >>> # stream lines from an S3 object + >>> for line in smart_open('s3://mybucket/mykey.txt', 'rb'): + ... print(line.decode('utf8')) - >>> # you can also use a boto.s3.key.Key instance directly: - >>> key = boto.connect_s3().get_bucket("my_bucket").get_key("my_key") - >>> with smart_open.smart_open(key) as fin: - ... for line in fin: - ... print line + >>> # stream from/to compressed files, with transparent (de)compression: + >>> for line in smart_open('./foo.txt.gz', encoding='utf8'): + ... print(line) >>> # can use context managers too: - >>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin: + >>> with smart_open('/home/radim/foo.txt.bz2', 'wb') as fout: + ... fout.write(u"some content\n".encode('utf8')) + + >>> with smart_open('s3://mybucket/mykey.txt', 'rb') as fin: ... for line in fin: - ... print line + ... print(line.decode('utf8')) ... fin.seek(0) # seek to the beginning - ... print fin.read(1000) # read 1000 bytes + ... b1000 = fin.read(1000) # read 1000 bytes >>> # stream from HDFS - >>> for line in smart_open.smart_open('hdfs://user/hadoop/my_file.txt'): - ... print line + >>> for line in smart_open('hdfs://user/hadoop/my_file.txt', encoding='utf8'): + ... print(line) >>> # stream from HTTP - >>> for line in smart_open.smart_open('http://example.com/index.html'): - ... print line + >>> for line in smart_open('http://example.com/index.html'): + ... print(line) >>> # stream from WebHDFS - >>> for line in smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt'): - ... print line + >>> for line in smart_open('webhdfs://host:port/user/hadoop/my_file.txt'): + ... print(line) >>> # stream content *into* S3 (write mode): - >>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout: - ... for line in ['first line', 'second line', 'third line']: - ... fout.write(line + '\n') + >>> with smart_open('s3://mybucket/mykey.txt', 'wb') as fout: + ... for line in [b'first line\n', b'second line\n', b'third line\n']: + ... fout.write(line) >>> # stream content *into* HDFS (write mode): - >>> with smart_open.smart_open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: - ... for line in ['first line', 'second line', 'third line']: - ... fout.write(line + '\n') + >>> with smart_open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: + ... for line in [b'first line\n', b'second line\n', b'third line\n']: + ... fout.write(line) >>> # stream content *into* WebHDFS (write mode): - >>> with smart_open.smart_open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: - ... for line in ['first line', 'second line', 'third line']: - ... fout.write(line + '\n') + >>> with smart_open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: + ... for line in [b'first line\n', b'second line\n', b'third line\n']: + ... fout.write(line) - >>> # stream from/to local compressed files: - >>> for line in smart_open.smart_open('./foo.txt.gz'): - ... print line + >>> # stream using a completely custom s3 server, like s3proxy: + >>> for line in smart_open('s3u://user:secret@host:port@mybucket/mykey.txt', 'rb'): + ... print(line.decode('utf8')) - >>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout: - ... fout.write("some content\n") + >>> # you can also use a boto.s3.key.Key instance directly: + >>> key = boto.connect_s3().get_bucket("my_bucket").get_key("my_key") + >>> with smart_open(key, 'rb') as fin: + ... for line in fin: + ... print(line.decode('utf8')) -Since going over all (or select) keys in an S3 bucket is a very common operation, -there's also an extra method ``smart_open.s3_iter_bucket()`` that does this efficiently, -**processing the bucket keys in parallel** (using multiprocessing): -.. code-block:: python +Why? +---- - >>> # get all JSON files under "mybucket/foo/" - >>> bucket = boto.connect_s3().get_bucket('mybucket') - >>> for key, content in s3_iter_bucket(bucket, prefix='foo/', accept_key=lambda key: key.endswith('.json')): - ... print key, len(content) +Working with large S3 files using Amazon's default Python library, `boto `_ and `boto3 `_, is a pain. Its ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files (loaded in RAM, no streaming). +There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate. -For more info (S3 credentials in URI, minimum S3 part size...) and full method signatures, check out the API docs: +``smart_open`` shields you from that. It builds on boto3 but offers a cleaner, Pythonic API. The result is less code for you to write and fewer bugs to make. -.. code-block:: python +Installation +------------ +:: - >>> import smart_open - >>> help(smart_open.smart_open_lib) + pip install smart_open + +Or, if you prefer to install from the `source tar.gz `_:: + + python setup.py test # run unit tests + python setup.py install + +To run the unit tests (optional), you'll also need to install `mock `_ , `moto `_ and `responses `_ (``pip install mock moto responses``). The tests are also run automatically with `Travis CI `_ on every commit push & pull request. S3-Specific Options ------------------- +The S3 reader supports gzipped content transparently, as long as the key is obviously a gzipped file (e.g. ends with ".gz"). + There are a few optional keyword arguments that are useful only for S3 access. The **host** and **profile** arguments are both passed to `boto.s3_connect()` as keyword arguments: .. code-block:: python - >>> smart_open.smart_open('s3://', host='s3.amazonaws.com') - >>> smart_open.smart_open('s3://', profile_name='my-profile') - + >>> smart_open('s3://', host='s3.amazonaws.com') + >>> smart_open('s3://', profile_name='my-profile') The **s3_session** argument allows you to provide a custom `boto3.Session` instance for connecting to S3: .. code-block:: python - >>> smart_open.smart_open('s3://', s3_session=boto3.Session()) + >>> smart_open('s3://', s3_session=boto3.Session()) The **s3_upload** argument accepts a dict of any parameters accepted by `initiate_multipart_upload `_: .. code-block:: python - >>> smart_open.smart_open('s3://', s3_upload={ 'ServerSideEncryption': 'AES256' }) - - -The S3 reader supports gzipped content, as long as the key is obviously a gzipped file (e.g. ends with ".gz"). - -Why? ----- - -Working with large S3 files using Amazon's default Python library, `boto `_, is a pain. Its ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files (loaded in RAM, no streaming). -There are nasty hidden gotchas when using ``boto``'s multipart upload functionality, and a lot of boilerplate. - -``smart_open`` shields you from that. It builds on boto but offers a cleaner API. The result is less code for you to write and fewer bugs to make. - -Installation ------------- -:: - - pip install smart_open + >>> smart_open('s3://', s3_upload={ 'ServerSideEncryption': 'AES256' }) -Or, if you prefer to install from the `source tar.gz `_:: +Since going over all (or select) keys in an S3 bucket is a very common operation, +there's also an extra method ``smart_open.s3_iter_bucket()`` that does this efficiently, +**processing the bucket keys in parallel** (using multiprocessing): - python setup.py test # run unit tests - python setup.py install +.. code-block:: python -To run the unit tests (optional), you'll also need to install `mock `_ , `moto `_ and `responses ` (``pip install mock moto responses``). The tests are also run automatically with `Travis CI `_ on every commit push & pull request. + >>> from smart_open import smart_open, s3_iter_bucket + >>> # get all JSON files under "mybucket/foo/" + >>> bucket = boto.connect_s3().get_bucket('mybucket') + >>> for key, content in s3_iter_bucket(bucket, prefix='foo/', accept_key=lambda key: key.endswith('.json')): + ... print(key, len(content)) -Todo ----- +For more info (S3 credentials in URI, minimum S3 part size...) and full method signatures, check out the API docs: -``smart_open`` is an ongoing effort. Suggestions, pull request and improvements welcome! +.. code-block:: python -On the roadmap: + >>> import smart_open + >>> help(smart_open.smart_open_lib) -* better documentation for the default ``file://`` scheme Comments, bug reports --------------------- -``smart_open`` lives on `github `_. You can file -issues or pull requests there. +``smart_open`` lives on `Github `_. You can file +issues or pull requests there. Suggestions, pull requests and improvements welcome! ---------------- ``smart_open`` is open source software released under the `MIT license `_. -Copyright (c) 2015-now `Radim Řehůřek `_. +Copyright (c) 2015-now `Radim Řehůřek `_.