-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downloading genomes from GenBank using FTP runs into ftblib.error_perm #21
Comments
Removing everything and then re-running, we have the following error: Traceback (most recent call last): Looks like it will be hard to track down why this is happening. |
Running the same command three more times results in the EOFError when downloading the same genome: Coprobacillus_sp._AF13-4LB. Perhaps there is pattern after all. Now trying to track down why this happens. |
It looks like I was mistaken, the error is not for Coprobacillus_sp._AF13-4LB. Coprobacillus_sp._AF13-4LB is downloaded correctly without any issues. The problem is with the next genome in the list: Streptomyces_sp._SID7817. The genomic.fna file is not downloaded. @Omar-HeshamR if you are investigating Coprobacillus_sp._AF13-4LB, you may want to skip that for now. Probably the correct way to go about it is to investigate the error message itself first. The ftplib.py getline() function documents that an EOFError occurs from a closed connection. I don't think there is much to do about closed connection, other than simply skipping this genome. |
I tested with Streptomyces_sp._SID7817 and it worked completely fine using the same code, so again I think it is independent from the genome it self, but rather to do with the connection with ftplib, still investigating. |
I think you are correct. I am leaving this for tonight and will look into it tomorrow again, but I guess we are too optimistic assuming that the ftp connection will stay alive for 500/1000 genomes. Probably the connection just closes itself after a number of downloads. We could try to reestablish the connection periodically, or use multiple threads for a limited number of genomes. I’m still not sure which would be the best way to go about it. Using multiple threads may also make it faster, but also introduces the added complexity of dividing and coordinating among the threads. |
Yes am going to start by first trying an approach of resetting the connection periodically to see if that's the root cause of the problem, then I will look into using multiple threads to make it faster. |
I added a quick fix in this script. This is just invoking the same script bunch of times with different random seeds. Every invocation downloads 10 genomes. There are 51 invocations. There should have been 510 genomes. Naturally, some genomes were repeated. In the end, after running this script, we have 472 genomes downloaded without any errors. I think that 10 genomes are small enough that the server did not interrupt the FTP connection. This is not the cleanest solution, but at least now we have a large number of genomes to start experimenting. |
Yes, after running the experiment 100s of times, I think the average crash time is after ~40 genomes, so I agree that 10 should rarely crash. I think your approach is most likely faster, let me know if we are for sure going with that approach, so that I am aware if I should keep testing mine or not. |
Command: get_reference_genomes.py -n 600 -s data -u
Script where error occurs: get_reference_genome.py
Traceback:
File "../../scripts/get_reference_genomes.py", line 242, in
main()
File "../../scripts/get_reference_genomes.py", line 194, in main
helper.go_to_direct()
File "../../scripts/get_reference_genomes.py", line 45, in go_to_direct
ftp.cwd(ftp.nlst()[0])
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 621, in cwd
return self.voidcmd(cmd)
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 282, in voidcmd
return self.voidresp()
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 255, in voidresp
resp = self.getresp()
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 250, in getresp
raise error_perm(resp)
ftplib.error_perm: 550 GCF_000022645.1_ASM2264v1_assembly_report.txt: No such file or directory
As @Omar-HeshamR reported verbally, this error is sporadic, and does not repeat deterministically.
The text was updated successfully, but these errors were encountered: