Skip to content

Commit

Permalink
prepare v0.9.1 (#23)
Browse files Browse the repository at this point in the history
* prepare v0.9.1

* Readme wording
  • Loading branch information
adbar authored Apr 24, 2023
1 parent 57b5fae commit a144749
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 1 deletion.
8 changes: 8 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
## History / Changelog


### 0.9.1

- network tests: larger throughput
- UrlStore: optional compression of rules (#21), added `reset()` (#22) and `get_all_counts()` methods
- UrlStore fixes: `signal` in #18, `total_url_number`
- updated Readme


### 0.9.0

- hardening of filters and URL parses (#14)
Expand Down
8 changes: 8 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,7 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs,
- ``dump_urls()``: Return a list of all known URLs.
- ``print_urls()``: Print all URLs in store (URL + TAB + visited or not).
- ``print_unvisited_urls()``: Print all unvisited URLs in store.
- ``get_all_counts()``: Return all download counts for the hosts in store.
- ``get_known_domains()``: Return all known domains as a list.
- ``total_url_number()``: Find number of all URLs in store.
- ``is_known(url)``: Check if the given URL has already been stored.
Expand All @@ -263,6 +264,7 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs,
- ``filter_unvisited_urls(urls)``: Take a list of URLs and return the currently unvisited ones.
- ``find_known_urls(domain)``: Get all already known URLs for the given domain (ex. "https://example.org").
- ``find_unvisited_urls(domain)``: Get all unvisited URLs for the given domain.
- ``reset()``: Re-initialize the URL store.
- Crawling and downloads
- ``get_url(domain)``: Retrieve a single URL and consider it to be visited (with corresponding timestamp).
- ``get_rules(domain)``: Return the stored crawling rules for the given website.
Expand All @@ -273,6 +275,12 @@ The ``UrlStore`` class allow for storing and retrieving domain-classified URLs,
- ``unvisited_websites_number()``: Return the number of websites for which there are still URLs to visit.
- ``is_exhausted_domain(domain)``: Tell if all known URLs for the website have been visited.

Optional settings:
- ``compressed=True``: activate compression of URLs and rules
- ``language=XX``: focus on a particular target language (two-letter code)
- ``strict=True``: stricter URL filtering
- ``verbose=True``: dump URLs if interrupted (requires use of ``signal``)


Command-line
------------
Expand Down
2 changes: 1 addition & 1 deletion courlan/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
__author__ = "Adrien Barbaresi"
__license__ = "GNU GPL v3+"
__copyright__ = "Copyright 2020-2023, Adrien Barbaresi"
__version__ = "0.9.0"
__version__ = "0.9.1"


# imports
Expand Down

0 comments on commit a144749

Please sign in to comment.