Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading genomes from GenBank using FTP runs into ftblib.error_perm #21

Open
mahmudhera opened this issue Aug 30, 2022 · 8 comments
Open

Comments

@mahmudhera
Copy link
Member

mahmudhera commented Aug 30, 2022

Command: get_reference_genomes.py -n 600 -s data -u

Script where error occurs: get_reference_genome.py

Traceback:

File "../../scripts/get_reference_genomes.py", line 242, in
main()
File "../../scripts/get_reference_genomes.py", line 194, in main
helper.go_to_direct()
File "../../scripts/get_reference_genomes.py", line 45, in go_to_direct
ftp.cwd(ftp.nlst()[0])
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 621, in cwd
return self.voidcmd(cmd)
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 282, in voidcmd
return self.voidresp()
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 255, in voidresp
resp = self.getresp()
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 250, in getresp
raise error_perm(resp)
ftplib.error_perm: 550 GCF_000022645.1_ASM2264v1_assembly_report.txt: No such file or directory

As @Omar-HeshamR reported verbally, this error is sporadic, and does not repeat deterministically.

@mahmudhera
Copy link
Member Author

Removing everything and then re-running, we have the following error:

Traceback (most recent call last):
File "../../scripts/get_reference_genomes.py", line 242, in
main()
File "../../scripts/get_reference_genomes.py", line 204, in main
helper.download_FNA_file(path, current_directory_name)
File "../../scripts/get_reference_genomes.py", line 76, in download_FNA_file
ftp.retrbinary(f"RETR {filename}", file.write)
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 441, in retrbinary
return self.voidresp()
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 255, in voidresp
resp = self.getresp()
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 240, in getresp
resp = self.getmultiline()
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 226, in getmultiline
line = self.getline()
File "/home/grads/mbr5797/.conda/envs/KEGG_env/lib/python3.8/ftplib.py", line 214, in getline
raise EOFError
EOFError

Looks like it will be hard to track down why this is happening.

@mahmudhera
Copy link
Member Author

Running the same command three more times results in the EOFError when downloading the same genome: Coprobacillus_sp._AF13-4LB.

Perhaps there is pattern after all. Now trying to track down why this happens.

mahmudhera added a commit to mahmudhera/KEGG_sketching_annotation that referenced this issue Aug 30, 2022
@mahmudhera
Copy link
Member Author

It looks like I was mistaken, the error is not for Coprobacillus_sp._AF13-4LB. Coprobacillus_sp._AF13-4LB is downloaded correctly without any issues. The problem is with the next genome in the list: Streptomyces_sp._SID7817. The genomic.fna file is not downloaded.

@Omar-HeshamR if you are investigating Coprobacillus_sp._AF13-4LB, you may want to skip that for now.

Probably the correct way to go about it is to investigate the error message itself first. The ftplib.py getline() function documents that an EOFError occurs from a closed connection. I don't think there is much to do about closed connection, other than simply skipping this genome.

@Omar-HeshamR
Copy link
Collaborator

I tested with Streptomyces_sp._SID7817 and it worked completely fine using the same code, so again I think it is independent from the genome it self, but rather to do with the connection with ftplib, still investigating.

@mahmudhera
Copy link
Member Author

mahmudhera commented Aug 31, 2022

I think you are correct. I am leaving this for tonight and will look into it tomorrow again, but I guess we are too optimistic assuming that the ftp connection will stay alive for 500/1000 genomes. Probably the connection just closes itself after a number of downloads. We could try to reestablish the connection periodically, or use multiple threads for a limited number of genomes. I’m still not sure which would be the best way to go about it. Using multiple threads may also make it faster, but also introduces the added complexity of dividing and coordinating among the threads.

@Omar-HeshamR
Copy link
Collaborator

Yes am going to start by first trying an approach of resetting the connection periodically to see if that's the root cause of the problem, then I will look into using multiple threads to make it faster.

mahmudhera added a commit to mahmudhera/KEGG_sketching_annotation that referenced this issue Aug 31, 2022
mahmudhera added a commit to mahmudhera/KEGG_sketching_annotation that referenced this issue Aug 31, 2022
@mahmudhera
Copy link
Member Author

mahmudhera commented Aug 31, 2022

I added a quick fix in this script. This is just invoking the same script bunch of times with different random seeds. Every invocation downloads 10 genomes. There are 51 invocations. There should have been 510 genomes. Naturally, some genomes were repeated. In the end, after running this script, we have 472 genomes downloaded without any errors. I think that 10 genomes are small enough that the server did not interrupt the FTP connection.

This is not the cleanest solution, but at least now we have a large number of genomes to start experimenting.

@Omar-HeshamR
Copy link
Collaborator

Yes, after running the experiment 100s of times, I think the average crash time is after ~40 genomes, so I agree that 10 should rarely crash. I think your approach is most likely faster, let me know if we are for sure going with that approach, so that I am aware if I should keep testing mine or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants