Download multiple urls with download timeout #703

vodkaslime · 2024-09-27T11:36:10Z

Trying to download multiple urls with download timeout.

I could download single urls one by one with fetch_url with setting download timeout. (Not sure if it's best practice to set download timeout):

config = use_config()
config.set("DEFAULT", "DOWNLOAD_TIMEOUT", "5")
downloaded = fetch_url(url, config=config)

However when following tutorial https://trafilatura.readthedocs.io/en/latest/downloads.html:

from trafilatura.downloads import add_to_compressed_dict, buffered_downloads, load_download_buffer

# list of URLs
mylist = ['https://www.example.org', 'https://www.httpbin.org/html']
# number of threads to use
threads = 4

# converted the input list to an internal format
url_store = add_to_compressed_dict(mylist)
# processing loop
while url_store.done is False:
    bufferlist, url_store = load_download_buffer(url_store, sleep_time=5)
    # process downloads
    for url, result in buffered_downloads(bufferlist, threads):
        # do something here
        print(url)
        print(result)

I'm not sure how to add DOWNLOAD_TIMEOUT to each connection in this code. It would be great if anyone could help out.

Thanks

The text was updated successfully, but these errors were encountered:

adbar · 2024-10-01T10:57:35Z

Hi @vodkaslime, indeed. It is not currently possible to pass a suitable argument to buffered_downloads, there is a missing link between the config (older) and options (newer) formats.

The code and the docs are both impacted and both need to be updated.

adbar · 2024-11-07T17:49:49Z

@vodkaslime The PR above fixes the issue.

A description could be added to the docs.

adbar added enhancement New feature or request documentation Docs in need of update or extension labels Oct 1, 2024

This was referenced Oct 30, 2024

CLI downloads: make sure all user-specified options are used #732

Closed

Downloads: fully use information from both config and options variables #733

Closed

adbar removed the enhancement New feature or request label Nov 7, 2024

vodkaslime closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download multiple urls with download timeout #703

Download multiple urls with download timeout #703

vodkaslime commented Sep 27, 2024

adbar commented Oct 1, 2024

adbar commented Nov 7, 2024

Download multiple urls with download timeout #703

Download multiple urls with download timeout #703

Comments

vodkaslime commented Sep 27, 2024

adbar commented Oct 1, 2024

adbar commented Nov 7, 2024