Skip to content

Commit

Permalink
Doc/45 add more code examples to documentation (#54)
Browse files Browse the repository at this point in the history
  • Loading branch information
noexec authored Apr 23, 2023
2 parents f60a5be + 83dc148 commit f70cd52
Show file tree
Hide file tree
Showing 3 changed files with 112 additions and 74 deletions.
184 changes: 110 additions & 74 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/curldl)](https://pypi.org/project/curldl/)
[![GitHub Workflow Status](https://github.com/noexec/curldl/actions/workflows/ci.yml/badge.svg)](https://github.com/noexec/curldl/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/noexec/curldl/branch/develop/graph/badge.svg?token=QOA9KZ9A44)](https://codecov.io/gh/noexec/curldl)
[![security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
[![Read the Docs](https://img.shields.io/readthedocs/curldl)](https://curldl.readthedocs.io/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/imports-isort-1674b1.svg?labelColor=ef8336)](https://pycqa.github.io/isort/)
Expand All @@ -20,23 +21,6 @@ The __curldl__ Python module safely and reliably downloads files with [PycURL](h
* Speed: since native _libcurl_ writes directly to the output stream file descriptor, there are no transfers of large chunks of data inside Python interpreter.


# Installation

The only requirement for _curldl_ is Python 3.8+. Install the package as follows:
```shell
pip install curldl
```

If you encounter a build failure during installation of _pycurl_ dependency, the following should help:
* On Linux, install one of:
* _pycurl_ package from distribution repo — e.g., on Ubuntu run `sudo apt install python3-pycurl`
* _libcurl_ development files with `sudo apt install build-essential libcurl4-openssl-dev`
* On Windows, install an unofficial _pycurl_ build since official builds are not available at the moment — e.g., by [Christoph Gohlke](https://www.lfd.uci.edu/~gohlke/pythonlibs/#pycurl), or use _Conda_ (see below).
* On Windows and macOS, use _Conda_ or _Miniconda_ with [conda-forge](https://conda-forge.org/) channel. For instance, see runtime dependencies in the following [test environment](https://github.com/noexec/curldl/blob/develop/misc/conda/test-environment.yml).

Overall, _curldl_ is expected to have no issues in any environment with Python 3.8+ (CPython or PyPy) — see [Testing](#testing) section below.


# Usage

Most examples below use the _curldl_ wrapper script instead of Python code. Of course, in all cases it is easy to write a few lines of code with identical functionality — see the first example. Also, note that inline documentation is available for all functions.
Expand All @@ -48,10 +32,11 @@ The following code snippet downloads a file and verifies its size and SHA-1 dige

```python
import curldl, os
dl = curldl.Curldl(basedir='downloads', progress=True)
dl.get('https://kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz', 'linux-0.01.tar.gz',
size=73091, digests={'sha1': '566b6fb6365e25f47b972efa1506932b87d3ca7d'})
assert os.path.exists('downloads/linux-0.01.tar.gz')
dl = curldl.Curldl(basedir="downloads", progress=True)
dl.get("https://kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz",
"linux-0.01.tar.gz", size=73091,
digests={"sha1": "566b6fb6365e25f47b972efa1506932b87d3ca7d"})
assert os.path.exists("downloads/linux-0.01.tar.gz")
```

If verification fails, the partial download is removed; otherwise it is renamed to the target file after being timestamped with _last-modified_ timestamp received from the server.
Expand All @@ -63,128 +48,179 @@ curldl -b downloads -s 73091 -a sha1 -d 566b6fb6365e25f47b972efa1506932b87d3ca7d
-p -l debug https://kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz
```

The corresponding log output:
The corresponding (redacted) log output:

```text
INFO:curldl.cli:Saving download(s) to: linux-0.01.tar.gz
DEBUG:curldl.cli:Configured: Namespace(basedir='downloads', output=['linux-0.01.tar.gz'], size=73091, algo='sha1', digest='566b6fb6365e25f47b972efa1506932b87d3ca7d', progress=True, log='debug', verbose=False, url=['https://kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz'])
INFO:curldl.util.fs:Creating directory: downloads
INFO:curldl.curldl:Downloading https://kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz to downloads/linux-0.01.tar.gz.part
INFO:curldl.curldl:Finished downloading downloads/linux-0.01.tar.gz.part 0 -> 73,091 B (HTTPS 200: OK) [0:00:01]
DEBUG:curldl.util.fs:Timestamping downloads/linux-0.01.tar.gz.part with 1993-10-30 00:00:00+00:00
DEBUG:curldl.util.fs:Successfully verified file size of downloads/linux-0.01.tar.gz.part
DEBUG:curldl.util.crypt:Computing 160-bit SHA1 for downloads/linux-0.01.tar.gz.part
INFO:curldl.util.crypt:Successfully verified SHA1 of downloads/linux-0.01.tar.gz.part
DEBUG:curldl.curldl:Partial download of downloads/linux-0.01.tar.gz passed verification (73091 / {'sha1': '566b6fb6365e25f47b972efa1506932b87d3ca7d'})
DEBUG:curldl.curldl:Moving downloads/linux-0.01.tar.gz.part to downloads/linux-0.01.tar.gz
Saving download(s) to: linux-0.01.tar.gz
Creating directory: downloads
Downloading https://kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz
to downloads/linux-0.01.tar.gz.part
Finished downloading downloads/linux-0.01.tar.gz.part 0 -> 73,091 B
(HTTPS 200: OK) [0:00:01]
Timestamping downloads/linux-0.01.tar.gz.part with 1993-10-30 00:00:00+00:00
Successfully verified file size of downloads/linux-0.01.tar.gz.part
Successfully verified SHA1 of downloads/linux-0.01.tar.gz.part
Partial download of downloads/linux-0.01.tar.gz passed verification
(73091 / {'sha1': '566b6fb6365e25f47b972efa1506932b87d3ca7d'})
Moving downloads/linux-0.01.tar.gz.part to downloads/linux-0.01.tar.gz
```

Note that renaming of `downloads/linux-0.01.tar.gz.part` to `downloads/linux-0.01.tar.gz` is the very last action of `Curldl.get()` method.
Note that renaming of `downloads/linux-0.01.tar.gz.part` to `downloads/linux-0.01.tar.gz` is the very last action of `Curldl.get()` method. If the target filename exists, the download succeeded and passed verification, if requested.


## Repeated Download

Running the same command again doesn't actually result in a server request since file size matches (digest is not checked since it would be time-prohibitive when mirroring large repositories):

```text
INFO:curldl.cli:Saving download(s) to: linux-0.01.tar.gz
DEBUG:curldl.cli:Configured: Namespace(basedir='downloads', output=['linux-0.01.tar.gz'], size=73091, algo='sha1', digest='566b6fb6365e25f47b972efa1506932b87d3ca7d', progress=True, log='debug', verbose=False, url=['https://kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz'])
DEBUG:curldl.curldl:Skipping update of downloads/linux-0.01.tar.gz since it has the expected size 73,091 B
Saving download(s) to: linux-0.01.tar.gz
Skipping update of downloads/linux-0.01.tar.gz since it has
the expected size 73,091 B
```

We can also request the same file without providing an expected size:

```shell
curldl -b downloads -p -l debug ftp://ftp.hosteurope.de/mirror/ftp.kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz
```python
import curldl
dl = curldl.Curldl(basedir="downloads", progress=True)
dl.get("ftp://ftp.hosteurope.de/mirror/ftp.kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz",
"linux-0.01.tar.gz")
```

In this case, the download is skipped due to _if-modified-since_ check:
In this case, the download is skipped due to _If-Modified-Since_ check:

```text
INFO:curldl.cli:Saving download(s) to: linux-0.01.tar.gz
DEBUG:curldl.cli:Configured: Namespace(basedir='downloads', output=['linux-0.01.tar.gz'], size=None, algo='sha256', digest=None, progress=True, log='debug', verbose=False, url=['ftp://ftp.hosteurope.de/mirror/ftp.kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz'])
INFO:curldl.curldl:Downloading ftp://ftp.hosteurope.de/mirror/ftp.kernel.org/pub/linux/kernel/Historic/linux-0.01.tar.gz to downloads/linux-0.01.tar.gz.part
DEBUG:curldl.curldl:Will update downloads/linux-0.01.tar.gz.part if modified since 1993-10-30 00:00:00+00:00
INFO:curldl.curldl:Discarding downloads/linux-0.01.tar.gz.part because it is not more recent
DEBUG:curldl.curldl:Removing downloads/linux-0.01.tar.gz.part since size of 0 B is below threshold or removal requested
Will update downloads/linux-0.01.tar.gz.part
if modified since 1993-10-30 00:00:00+00:00
Discarding downloads/linux-0.01.tar.gz.part because
it is not more recent
```

Note that FTP protocol was used this time — _curldl_ is protocol-agnostic when using the underlying _libcurl_ functionality.
Note that FTP protocol was used this time — _curldl_ is entirely protocol-agnostic when using the underlying _libcurl_ functionality.


## Resuming Download

If a download is interrupted, it will be resumed on the next attempt (which may also be a retry according to the configured retry policy). Here is what happens when _Ctrl-C_ is used to send SIGINT signal to the Python process:
If a download is interrupted, it will be resumed on the next attempt (which may also be a retry according to the configured retry policy). Here is what happens when _Ctrl-C_ is used to send a SIGINT signal to the Python process. This example also demonstrates how to construct a filename from a URL (CLI interface does the same when `--output` switch is omitted).

```shell
curldl -b downloads -p https://releases.ubuntu.com/22.04.2/ubuntu-22.04.2-live-server-amd64.iso
```python
import curldl, os, urllib.parse as uparse
dl = curldl.Curldl(basedir="downloads", progress=True)
url = "https://releases.ubuntu.com/22.04.2/ubuntu-22.04.2-live-server-amd64.iso"
filename = os.path.basename(uparse.unquote(uparse.urlparse(url).path))
dl.get(url, filename)
```

The corresponding (redacted) log output:

```text
INFO:curldl.cli:Saving download(s) to: ubuntu-22.04.2-live-server-amd64.iso
INFO:curldl.curldl:Downloading https://releases.ubuntu.com/22.04.2/ubuntu-22.04.2-live-server-amd64.iso to downloads/ubuntu-22.04.2-live-server-amd64.iso.part
ubuntu-22.04.2-live-server-amd64.iso: 13%|██▋ | 244M/1.84G [00:06<00:38, 44.7MB/s]^CCRITICAL:curldl.util.log:KeyboardInterrupt:
ERROR:curldl.curldl:Download interrupted downloads/ubuntu-22.04.2-live-server-amd64.iso.part 0 -> 259,981,312 B (42: Callback aborted / HTTPS 200: OK) [0:00:07]
CRITICAL:curldl.util.log:error: (42, 'Callback aborted')
Downloading https://releases.ubuntu.com/22.04.2/ubuntu-22.04.2-live-server-amd64.iso
to downloads/ubuntu-22.04.2-live-server-amd64.iso.part
ubuntu-22.04.2-live-server-amd64.iso: 13%|██▋ | 244M/1.84G
[00:06<00:38, 44.7MB/s] ^C KeyboardInterrupt:
Download interrupted downloads/ubuntu-22.04.2-live-server-amd64.iso.part 0 -> 259,981,312 B
(42: Callback aborted / HTTPS 200: OK) [0:00:07]
```

Attempting the download again resumes the download:
Attempting the same download again resumes the download:

```text
INFO:curldl.cli:Saving download(s) to: ubuntu-22.04.2-live-server-amd64.iso
INFO:curldl.curldl:Resuming download of https://releases.ubuntu.com/22.04.2/ubuntu-22.04.2-live-server-amd64.iso to downloads/ubuntu-22.04.2-live-server-amd64.iso.part at 259,981,312 B
INFO:curldl.curldl:Finished downloading downloads/ubuntu-22.04.2-live-server-amd64.iso.part 259,981,312 -> 1,975,971,840 B (HTTPS 206: Partial Content) [0:01:24]
Resuming download of https://releases.ubuntu.com/22.04.2/ubuntu-22.04.2-live-server-amd64.iso
to downloads/ubuntu-22.04.2-live-server-amd64.iso.part at 259,981,312 B
Finished downloading downloads/ubuntu-22.04.2-live-server-amd64.iso.part
259,981,312 -> 1,975,971,840 B (HTTPS 206: Partial Content) [0:01:24]
```

Note, however, that we didn't provide a size or digest for verification. Since the downloaded file is timestamped only once download completes, how does _curldl_ know that the file wasn't changed on the server in the meantime? The answer is that _curldl_ simply avoids removing large partial downloads in such cases — see documentation for _always_keep_part_bytes_ constructor parameter of _Curldl_.


## Enabling Additional Protocols

By default, _curldl_ enables the following protocols:

- HTTP(S)
- FTP(S)
- SFTP

In order to enable a different set of protocols, use the `allowed_protocols_bitmask` constructor argument. For instance, the code below downloads a _file://_ URI:

```python
import curldl, pycurl, pathlib
protocols = pycurl.PROTO_FILE | pycurl.PROTO_HTTPS
dl = curldl.Curldl(basedir="downloads", allowed_protocols_bitmask=protocols)
file_uri = pathlib.Path(__file__).absolute().as_uri()
dl.get(file_uri, "current_source.py")
```

Note, however, that we didn't provide a size or digest for verification. Since the downloaded file is timestamped only once download completes, how does _curldl_ know that the file wasn't changed on the server in the meantime? The answer is that _curldl_ simply avoids removing large partial downloads in such cases — see inline documentation for _always_keep_part_bytes_ constructor parameter of _Curldl_.
To enable all protocols, use `allowed_protocols_bitmask=pycurl.PROTO_ALL`. Note, however, that there might be security repercussions.


## Escaping Base Directory

Attempts to escape base directory are prevented, e.g.:

```shell
curldl --basedir . http://example.com/ --output ../file.txt
```python
import curldl, os
dl = curldl.Curldl(basedir=os.curdir)
dl.get("http://example.com/", os.path.join(os.pardir, "file.txt"))
```

The above results in:

```text
CRITICAL:curldl.util.log:ValueError: Relative path ../file.txt escapes base path /home/user/curldl
ValueError: Relative path ../file.txt escapes base path /home/user/curldl
```

_curldl_ performs extensive checks to prevent escaping the base download directory — see _FileSystem_ class implementation and unit tests for details.


# Installation

The only requirement for _curldl_ is Python 3.8+. Install the package as follows:
```shell
pip install curldl
```

_curldl_ performs rather extensive checks to prevent base directory escaping — see _FileSystem_ class implementation and unit tests for details.
If you encounter a build failure during installation of _pycurl_ dependency, the following should help:
* On Linux, install one of:
* _pycurl_ package from distribution repo — e.g., on Ubuntu run `sudo apt install python3-pycurl`
* _libcurl_ development files with `sudo apt install build-essential libcurl4-openssl-dev`
* On Windows, install an unofficial _pycurl_ build since official builds are not available at the moment — e.g., from [Christoph Gohlke](https://www.lfd.uci.edu/~gohlke/pythonlibs/#pycurl)'s packages, or use _Conda_ (see below).
* On Windows and macOS, use _Conda_ or _Miniconda_ with [conda-forge](https://conda-forge.org/) channel. For instance, see runtime dependencies in the following [test environment](https://github.com/noexec/curldl/blob/develop/misc/conda/test-environment.yml).

Overall, _curldl_ is expected to not have any issues in any environment with Python 3.8+ (CPython or PyPy) — see the [Testing](#testing) section below.


# Testing

A simplified configuration matrix covered by [CI/CD test + build pipeline](https://github.com/noexec/curldl/actions/workflows/ci.yml) at the time of writing this document is presented below:

| Platform | CPython 3.8 | PyPy 3.8 | PyPy 3.9 | CPython 3.9 | CPython 3.10 | CPython 3.11 |
|-------------|-----------------------|----------|----------|-------------|----------------|--------------|
| Ubuntu-x64 | venv, conda, platform | venv | venv | venv | venv, platform | venv, conda |
| Windows-x64 | venv, conda | | | | | venv, conda |
| Windows-x86 | venv | | | | | venv |
| macOS-x64 | conda | | | | | conda |
| Platform | CPython 3.8 | CPython 3.9, PyPy 3.8–3.9 | CPython 3.10 | CPython 3.11 |
|-------------|-----------------------|----------------------------|----------------|--------------|
| Ubuntu-x64 | venv, conda, platform | venv | venv, platform | venv, conda |
| Windows-x64 | venv, conda | | | venv, conda |
| Windows-x86 | venv | | | venv |
| macOS-x64 | conda | | | conda |

In the table:
* _venv_ — virtual environment with all package dependencies and [editable package install](https://pip.pypa.io/en/stable/topics/local-project-installs/); on Ubuntu includes tests with minimal versions of package dependencies;
* _conda__Miniconda_ with package dependencies installed from _mini-forge_ channel, and _curldl_ as editable package install;
* _platform_ — as many dependencies as possible satisfied via Ubuntu package repository, and _curldl_ as _wheel_ install.

The CI/CD pipeline succeeds only if _curldl_ package successfully builds and passes all the [pytest](https://pytest.org/) test cases with 100% [code coverage](https://coverage.readthedocs.io/), as well as [Pylint](https://pylint.readthedocs.io/), [Mypy](https://mypy-lang.org/) and [Bandit](https://bandit.readthedocs.io/) static code analysis. Note that the testing code is also covered by these restrictions.
The CI/CD pipeline succeeds only if _curldl_ package successfully builds and passes all the [pytest](https://pytest.org/) test cases with 100% [code coverage](https://coverage.readthedocs.io/), as well as [Pylint](https://pylint.readthedocs.io/), [Mypy](https://mypy-lang.org/) and [Bandit](https://bandit.readthedocs.io/) static code analysis. Code style checks are also a part of the pipeline. Note that the testing code is also covered by these restrictions.

In order to run tests locally with Python interpreter available in the system, install the _venv_ environment and run _pytest_ with static code analysis, code coverage and security checks as follows:
In order to run tests locally with Python interpreter installed in the system, install the _venv_ environment and run _pytest_ with static code analysis and code coverage:
```shell
./venv.sh install-venv
./venv.sh pytest
./venv.sh misc/scripts/run-bandit.sh
```

`venv.sh` is a convenience _venv_ wrapper that also enables some additional Python checks; you can simply activate the _venv_ environment instead. Testing with _Conda_ is possible as well — see the [CI/CD pipeline execution](https://github.com/noexec/curldl/actions) for details.


# Changelog

See [Changelog](https://github.com/noexec/curldl/blob/develop/docs/CHANGELOG.md) file for a summary of changes in each release.
See the [Changelog](https://github.com/noexec/curldl/blob/develop/docs/CHANGELOG.md) file for a summary of changes in each release.


# License
Expand Down
1 change: 1 addition & 0 deletions docs/changelog.d/54.doc.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Extend package usage documentation
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ classifiers = [
"Programming Language :: Python",
"Programming Language :: Python :: 3 :: Only",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.11",
"Topic :: Software Development :: Libraries :: Python Modules",
"Topic :: Internet :: WWW/HTTP",
"Topic :: Internet :: File Transfer Protocol (FTP)",
Expand Down

0 comments on commit f70cd52

Please sign in to comment.