-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyani download
is blocked if downloaded file cannot be uncompressed.
#383
Comments
I think this is different to #70 |
I believe this is also causing tests to fail, specifically these: tests/test_subcmd_01_download.py::test_download_dry_run FAILED [ 70%]
tests/test_subcmd_01_download.py::test_download_c_blochmannia FAILED [ 71%]
tests/test_subcmd_01_download.py::test_download_kraken FAILED [ 72%]
dryrun_namespace = Namespace(api_keypath=PosixPath('~/.ncbi/api_key'), batchsize=10000, classfname='classes.txt', disable_tqdm=True, dryr.../tmp/pytest-of-baileythegreen/pytest-17/test_download_dry_run0/C_blochmannia'), retries=20, taxon='203804', timeout=10)
def test_download_dry_run(dryrun_namespace):
"""Dry run of C. blochmannia download."""
> subcommands.subcmd_download(dryrun_namespace)
tests/test_subcmd_01_download.py:128:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyani/scripts/subcommands/subcmd_download.py:372: in subcmd_download
classes, labels, skippedlist = download_data(args, api_key, asm_dict)
pyani/scripts/subcommands/subcmd_download.py:124: in download_data
esummary, filestem = download.get_ncbi_esummary(
pyani/download.py:355: in get_ncbi_esummary
summary = entrez_esummary(
pyani/download.py:237: in wrapper
return Entrez.read(output, validate=False)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/__init__.py:508: in read
record = handler.read(handle)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:335: in read
self.parser.ParseFile(handle)
/opt/concourse/worker/volumes/live/b884be86-9a72-40c1-600c-116a7b9e8bbe/volume/python_1621446997202/work/Modules/pyexpat.c:407: in StartElement
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Bio.Entrez.Parser.DataHandler object at 0x127a6b280>, tag = 'eSummaryResult', attrs = {}
def handleMissingDocumentDefinition(self, tag, attrs):
"""Raise an Exception if neither a DTD nor an XML Schema is found."""
> raise ValueError(
"As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree."
)
E ValueError: As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree.
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:448: ValueError
--------------------------------------------------------- Captured stderr call ----------------------------------------------------------
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: Dry run only: will not overwrite or download
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
----------------------------------------------------------- Captured log call -----------------------------------------------------------
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:356 Downloading genomes from NCBI
WARNING pyani.scripts.subcommands.subcmd_download:subcmd_download.py:360 Dry run only: will not overwrite or download
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:76 Setting Entrez email address: my.email@my.domain
WARNING pyani.scripts.subcommands.subcmd_download:subcmd_download.py:339 API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:316 Taxon IDs received: ['203804']
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:319 Taxon ID summary
Query: 203804
asm count: 9
UIDs: ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:117 Downloading contigs for Taxon ID ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:120 Retrieving eSummary information for UID 8228891
______________________________________________________ test_download_c_blochmannia ______________________________________________________
base_download_namespace = Namespace(api_keypath=PosixPath('~/.ncbi/api_key'), batchsize=10000, classfname='classes.txt', disable_tqdm=True, dryr...ytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia'), retries=20, taxon='203804', timeout=10)
def test_download_c_blochmannia(base_download_namespace):
"""Test C. blochmannia download."""
> subcommands.subcmd_download(base_download_namespace)
tests/test_subcmd_01_download.py:133:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyani/scripts/subcommands/subcmd_download.py:372: in subcmd_download
classes, labels, skippedlist = download_data(args, api_key, asm_dict)
pyani/scripts/subcommands/subcmd_download.py:124: in download_data
esummary, filestem = download.get_ncbi_esummary(
pyani/download.py:355: in get_ncbi_esummary
summary = entrez_esummary(
pyani/download.py:237: in wrapper
return Entrez.read(output, validate=False)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/__init__.py:508: in read
record = handler.read(handle)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:335: in read
self.parser.ParseFile(handle)
/opt/concourse/worker/volumes/live/b884be86-9a72-40c1-600c-116a7b9e8bbe/volume/python_1621446997202/work/Modules/pyexpat.c:407: in StartElement
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Bio.Entrez.Parser.DataHandler object at 0x1275e7160>, tag = 'eSummaryResult', attrs = {}
def handleMissingDocumentDefinition(self, tag, attrs):
"""Raise an Exception if neither a DTD nor an XML Schema is found."""
> raise ValueError(
"As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree."
)
E ValueError: As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree.
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:448: ValueError
--------------------------------------------------------- Captured stderr call ----------------------------------------------------------
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts]: Output directory overwrite forced
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
[WARNING] [pyani.scripts.subcommands.subcmd_download]: API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
----------------------------------------------------------- Captured log call -----------------------------------------------------------
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:356 Downloading genomes from NCBI
INFO pyani.scripts:__init__.py:39 Creating output directory /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia
WARNING pyani.scripts:__init__.py:42 Output directory overwrite forced
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:76 Setting Entrez email address: my.email@my.domain
WARNING pyani.scripts.subcommands.subcmd_download:subcmd_download.py:339 API path /Users/baileythegreen/.ncbi/api_key not a valid file. Not using API key.
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:316 Taxon IDs received: ['203804']
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:319 Taxon ID summary
Query: 203804
asm count: 9
UIDs: ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:117 Downloading contigs for Taxon ID ['8228891', '5431901', '522068', '444958', '322791', '322771', '275848', '61868', '32848']
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:120 Retrieving eSummary information for UID 8228891
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:139 eSummary information (GCF_014857065.1_ASM1485706v1):
Species Taxid: 2681987
TaxID: 2681987
Accession: GCF_014857065.1
Name: ASM1485706v1
Organism: Blochmannia endosymbiont of Colobopsis nipponica
Genus: Blochmannia
Species: endosymbiont of Colobopsis nipponica
Strain:
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:239 Retrieving URLs for GCF_014857065.1_ASM1485706v1
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:292 Downloaded from URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/014/857/065/GCF_014857065.1_ASM1485706v1/GCF_014857065.1_ASM1485706v1_genomic.fna.gz
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:293 Wrote assembly to: /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna.gz
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:294 Wrote MD5 hashes to: /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_hashes.txt
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:298 Local MD5 hash: fbd87dfdbb889fad197db147c90790f8
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:299 NCBI MD5 hash: fbd87dfdbb889fad197db147c90790f8
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:301 MD5 hash check passed
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:184 Extracting archive /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna.gz to /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:211 Creating local MD5 hash for /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.fna
DEBUG pyani.scripts.subcommands.subcmd_download:subcmd_download.py:214 Writing hash to /private/tmp/pytest-of-baileythegreen/pytest-17/test_download_c_blochmannia0/C_blochmannia/GCF_014857065.1_ASM1485706v1_genomic.md5
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:161 Label and class file entries
Label: fb08eedc0cf49e1cf44a95539ae4fd7c GCF_014857065.1_ASM1485706v1_genomic B. endosymbiont of Colobopsis nipponica
Class: fb08eedc0cf49e1cf44a95539ae4fd7c GCF_014857065.1_ASM1485706v1_genomic Blochmannia endosymbiont of Colobopsis nipponica
INFO pyani.scripts.subcommands.subcmd_download:subcmd_download.py:120 Retrieving eSummary information for UID 5431901
_________________________________________________________ test_download_kraken __________________________________________________________
kraken_namespace = Namespace(api_keypath=PosixPath('~/.ncbi/api_key'), batchsize=10000, classfname='classes.txt', disable_tqdm=True, dryr.../private/tmp/pytest-of-baileythegreen/pytest-17/test_download_kraken0/kraken'), retries=20, taxon='203804', timeout=10)
def test_download_kraken(kraken_namespace):
"""C. blochmannia download in Kraken format."""
> subcommands.subcmd_download(kraken_namespace)
tests/test_subcmd_01_download.py:138:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyani/scripts/subcommands/subcmd_download.py:372: in subcmd_download
classes, labels, skippedlist = download_data(args, api_key, asm_dict)
pyani/scripts/subcommands/subcmd_download.py:124: in download_data
esummary, filestem = download.get_ncbi_esummary(
pyani/download.py:355: in get_ncbi_esummary
summary = entrez_esummary(
pyani/download.py:237: in wrapper
return Entrez.read(output, validate=False)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/__init__.py:508: in read
record = handler.read(handle)
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:335: in read
self.parser.ParseFile(handle)
/opt/concourse/worker/volumes/live/b884be86-9a72-40c1-600c-116a7b9e8bbe/volume/python_1621446997202/work/Modules/pyexpat.c:407: in StartElement
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Bio.Entrez.Parser.DataHandler object at 0x1039aeb80>, tag = 'eSummaryResult', attrs = {}
def handleMissingDocumentDefinition(self, tag, attrs):
"""Raise an Exception if neither a DTD nor an XML Schema is found."""
> raise ValueError(
"As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree."
)
E ValueError: As the XML data contained neither a Document Type Definition (DTD) nor an XML Schema, Bio.Entrez is unable to parse these data. We recommend using a generic XML parser from the Python standard library instead, for example ElementTree.
../miniconda3/lib/python3.8/site-packages/Bio/Entrez/Parser.py:448: ValueError |
I think we need to investigate why these tests are now failing due to the uncompression, when they were previously working. Is there something about the download that has changed? |
If you are able to reproduce the failures, you are welcome to try. I no longer seem to be able to. I did find this github issue, part of which seemed to indicate something like this could be caused by a temporary issue, but I can't say if that's what happened here. The traceback I copied above is the only example I have of those tests failing locally. |
The DTD file issue is different (I've encountered it before). With the issue I originally raised, several |
On testing the above command again today (2022-03-15) the downloads proceed without error. I'm calling this as a transitory issue, possibly a fault at NCBI's end, and closing the issue. |
Summary:
pyani
downloads are blocked if a downloaded file cannot be uncompressed.Description:
Using
pyani download
sometimes recovers corrupt compressed files from NCBI. If these throw an error withgunzip
, the whole download halts.What should happen is that the error is noted, and
pyani
continues with the remaining downloads.Reproducible Steps:
Three attempts, same error:
Current Output:
Expected Output:
The equivalent of the below, for the downloaded genome:
pyani Version:
v0.3-alpha
Python Version:
3.9
Operating System:
macOS 12.2.1
The text was updated successfully, but these errors were encountered: