Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not able to download some files #53

Open
francocatalano opened this issue Oct 18, 2024 · 5 comments
Open

not able to download some files #53

francocatalano opened this issue Oct 18, 2024 · 5 comments

Comments

@francocatalano
Copy link

Hi,
I have created the following query:
esgpull add project:CMIP6 experiment_id:historical,ssp126,ssp245,ssp585 member_id:r1i1p1f1 source_id:EC-Earth3,MPI-ESM1-2-HR table_id:day variable_id:pr,tasmax,tasmin --track
then:
esgpull update
esgpull download
Some files were effectively downloaded while others were not. I've tried many times with:
esgpull retry
esgpull download
but for 69 files I always end up with the following error (from logfile):
httpcore.ConnectError: All connection attempts failed

My configuration is the following:
[download]
chunk_size = 67108864
http_timeout = 20
max_concurrent = 5
disable_ssl = false
disable_checksum = false

[api]
index_node = "esgf-data.dkrz.de"
http_timeout = 20
max_concurrent = 5
page_limit = 50

[api.default_options]
distrib = "true"
latest = "none"
replica = "none"
retracted = "false"

I've tried switching different index nodes but it did not help.
The corresponding files are actually available on the nodes since I can get them from the ESGF web interface through HTTP Download

Any suggestions?
Thanks
Franco

@AtefBN
Copy link
Collaborator

AtefBN commented Nov 4, 2024

Hi Franco,

Sorry for the late reply. What you are experiencing is a classic syndrome of a few ESGF datanodes not responding (either temporarily or permanently). But this is not the end of the world, luckily most datasets (and files) are replicated across multiple datanodes, so the steps to follow are :

  1. Identify which datanodes are the culprits for you, they can be identified in the logs file that the download fail message points at.
  2. Ask esgpull to reset the links between the targeted datanodes and your query, this can be done via the python console and esgpull CLI
from esgpull import Esgpull, Query
from esgpull.models import FileStatus
esg = Esgpull("/your/esgpull/home/dir")
query = esg.graph.get("#####") #insert your query SHA here
missing_files = [f for f in query.files if f.status != FileStatus.Done] #list all the files in your query that were not downloaded
esg.db.delete(*missing_files) # reset the database entry 
  1. Now it is time to reconfigure your esgpull query and avoid these non-responding datanodes:

esgpull add project:CMIP6 experiment_id:historical,ssp126,ssp245,ssp585 member_id:r1i1p1f1 source_id:EC-Earth3,MPI-ESM1-2-HR table_id:day variable_id:pr,tasmax,tasmin \!data_node:<insert bad datanode(s) here> --track
Make sure you are using a backslash "" to escape the ! character in your command line. Your datanode list is like all other facets a comma separated string list (for example dpesgf03.nccs.nasa.gov,esg-dn2.nsc.liu.se)

  1. update your new query (should be a different SHA due to the added argument) and restart the download.
  2. Rinse and repeat until all your files are downloaded, in my experience should not require more than one pass.

This behaviour is due to esgpull having no knowledge a priori of the state of the datanode listed in ESGF catalogues. So for each file, one of the datanodes listed is picked randomly.

Do let me know if this solves your issue and sorry again for the delay

@francocatalano
Copy link
Author

Hi Atef,
Thank you for your reply and detailed explanation.
In the meanwhile, since I needed to get my hands on the data quickly, I did proceed manually with wget download script generated by the esgf datanode website.
I took note of your suggestion and, in case I would encounter a similar problem in the future, I'll try your solution and let you know how it goes.
Regards,
Franco

@hanjunkim0617
Copy link

hanjunkim0617 commented Feb 7, 2025

Hello @AtefBN

I have similar questions, but I'm now using Synda selection file to download.

I'm trying to download the data by converting Synda selection file as below,
but some files are not downloaded.

(esgpull) hk764@g2-login-02:~/esgpull2$ cat daily1.txt

project=CMIP6
experiment_id=historical hist-aer hist-GHG
source_id=ACCESS-CM2 ACCESS-ESM1-5 BCC-CSM2-MR CESM2 CanESM5 FGOALS-g3 HadGEM3-GC31-LL MIROC6 MP
I-ESM1-2-LR MRI-ESM2-0 NorESM2-LM CNRM-CM6-1 E3SM-2-0 GFDL-ESM4 IPSL-CM6A-LR
member_id=r1i1p1f1
table_id=day
variable_id=ta wap pr tas huss

(esgpull) hk764@g2-login-02:/esgpull2$ esgpull convert daily1.txt -o daily1.yaml
(esgpull) hk764@g2-login-02:/esgpull2$ esgpull add -q daily1.yaml
(esgpull) hk764@g2-login-02:/esgpull2$ esgpull update 2f3e10
(esgpull) hk764@g2-login-02:/esgpull2$ esgpull download

However, only 1.3 TiB / 2.0 TiB is downloaded, even if I try "retry" and "download" again and again.

When I see error logs, below two errors occur for many URLs:

ConnectError('All connection attempts failed')
HTTPStatusError("Client error '404 Not Found')

I think I can try to download the replica of datasets, but I don't know how to do it.
Could you help me to download all searched files?..

For your information, below is my configuration file.

[paths]
auth = "/home/hk764/esgpull2/auth"
data = "/share/cliprelabs/hk764/data_esgpull"
db = "/home/hk764/esgpull2/db"
log = "/home/hk764/esgpull2/log"
tmp = "/share/cliprelabs/hk764/data_esgpull/tmp"

[credentials]
filename = "credentials.toml"

[cli]
page_size = 20

[db]
filename = "esgpull.db"

[download]
chunk_size = 67108864
http_timeout = 20
max_concurrent = 5
disable_ssl = true
disable_checksum = false
show_filename = true

[api]
index_node = "esgf.ceda.ac.uk"
http_timeout = 20
max_concurrent = 5
page_limit = 50
default_query_id = ""

[api.default_options]
distrib = "true"
latest = "true"
replica = "none"
retracted = "false"

I wonder whether the change in 'replica' of api.default_options enables the downloading of replicas or not (I miss the "synda replica next").

Thank you so much for developing a wonderful program!
Best regards,
Hanjun

@AtefBN
Copy link
Collaborator

AtefBN commented Feb 7, 2025

Hi @hanjunkim0617, it seems whatever datanodes esgpull randomly selected from the list of available for those files is behaving badly. I have a few of these with larger queries and sometimes waiting a few days then retrying the download works.

For the time being you need to:

  1. identify the datanode that is causing the issue from the log files,
  2. flush these error files from the database,
  3. then build a new query while specifically asking esgpull to avoid that datanode(s).

For 1 just check the log files related to the download job, sometimes 503 errors are harder to debug
For 2 start a python shell in your environment and run these lines: https://gist.github.com/AtefBN/0293975cb7c57f12dd15e0c2029872b5
For 3, the easiest way is to run:
esgpull add -r 2f3e10 \!data_node:<insert datanode(s) you want to avoid here> --track
then of course update query and download, it basically builds a new query than inherits everything from your original but overloads it with extra criteria, which is avoiding the datanode here.

Just a ps, you can see what datanodes are available for you for a query by using the --hints flag for example:
esgpull search project:CMIP6 experiment_id:historical,hist-aer,hist-GHG source_id:ACCESS-CM2,ACCESS-ESM1-5,BCC-CSM2-MR,CESM2,CanESM5,FGOALS-g3,HadGEM3-GC31-LL,MIROC6 MP,I-ESM1-2-LR,MRI-ESM2-0,NorESM2-LM,CNRM-CM6-1,E3SM-2-0,GFDL-ESM4,IPSL-CM6A-LR variant_label:r1i1p1f1 table_id:day variable_id:ta,wap,pr,tas,huss --hints data_node

This works on all facets but can help you target a specific datanode if you're sure it behaves best and has better performance.

Hope this helps.

@hanjunkim0617
Copy link

Dear @AtefBN

Thanks so much for the quick reply!
I tried the method you suggested and completed the additional data.

I have two additional questions.

  1. Are there any differences between "esgpull remove" and your python code?
  2. What is the difference between the True and False for the option "replica"?

Thanks again for all of your help!!
Best regards,
Hanjun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants