Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior when downloading fastq using SRA identifier #34

Open
jolespin opened this issue Dec 13, 2023 · 3 comments
Open

Unexpected behavior when downloading fastq using SRA identifier #34

jolespin opened this issue Dec 13, 2023 · 3 comments

Comments

@jolespin
Copy link
Contributor

jolespin commented Dec 13, 2023

https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR13615821&display=metadata
image

I ran kingfisher and it pulled 3 fastq files for 1 record. A single ended and 2 paired end files.

(base) [jespinoz@exp-15-28 split_reads]$ kingfisher --version
0.3.1

ID=SRR13615821
kingfisher get -r ${ID} -m aws-http -f fastq.gz

I thought that maybe one was interleaved but the read sizes didn't match up:

(base) [jespinoz@exp-15-28 Fastq]$ seqkit stats SRR13615821_1.fastq.gz SRR13615821_2.fastq.gz split_reads/SRR13615821.fastq.gz
processed files:  3 / 3 [======================================] ETA: 0s. done
file                              format  type   num_seqs        sum_len  min_len  avg_len  max_len
SRR13615821_1.fastq.gz            FASTQ   DNA     808,228    197,172,014       35      244      301
SRR13615821_2.fastq.gz            FASTQ   DNA     808,228    199,461,172       21    246.8      301
split_reads/SRR13615821.fastq.gz  FASTQ   DNA   5,860,790  1,438,979,322       35    245.5      301

The above files were what were downloaded by kingfisher.

Note: I moved SRR13615821.fastq.gz into a separate folder to split the reads but BBSuite said there were no pairs:

base) [jespinoz@exp-15-28 split_reads]$ repair.sh in=SRR13615821.fastq.gz out1=SRR13615821_1.fastq.gz out2=SRR13615821_2.fastq.gz
java -ea -Xmx84979m -cp /expanse/projects/jcl110/miniconda3/opt/bbmap-39.01-1/current/ jgi.SplitPairsAndSingles rp in=SRR13615821.fastq.gz out1=SRR13615821_1.fastq.gz out2=SRR13615821_2.fastq.gz
Executing jgi.SplitPairsAndSingles [rp, in=SRR13615821.fastq.gz, out1=SRR13615821_1.fastq.gz, out2=SRR13615821_2.fastq.gz]

Set INTERLEAVED to false
Started output stream.

Input:                  	5860790 reads 		1438979322 bases.
Result:                 	5860790 reads (100.00%) 	1438979322 bases (100.00%)
Pairs:                  	0 reads (0.00%) 	0 bases (0.00%)
Singletons:             	5860790 reads (100.00%) 	1438979322 bases (100.00%)

Time:                         	36.897 seconds.
Reads Processed:       5860k 	158.84k reads/sec
Bases Processed:       1438m 	39.00m bases/sec

The above is me trying to split the reads manually.

Do you know what could be happening?

@jolespin
Copy link
Contributor Author

I tried downloading using a separate command:

(base) [jespinoz@exp-15-28 tmp]$ kingfisher get -r SRR13615821 -m ena-ascp aws-http prefetch
12/13/2023 11:39:51 AM INFO: Kingfisher v0.3.1
12/13/2023 11:39:51 AM INFO: Attempting download method ena-ascp for run SRR13615821 ..
12/13/2023 11:39:51 AM INFO: Using aspera ssh key file: /expanse/projects/jcl110/miniconda3/lib/python3.11/site-packages/kingfisher/data/asperaweb_id_dsa.openssh
12/13/2023 11:39:51 AM INFO: Querying ENA for FTP paths for SRR13615821..
12/13/2023 11:39:52 AM INFO: Downloading 3 FTP read set(s): ftp.sra.ebi.ac.uk/vol1/fastq/SRR136/021/SRR13615821/SRR13615821.fastq.gz, ftp.sra.ebi.ac.uk/vol1/fastq/SRR136/021/SRR13615821/SRR13615821_1.fastq.gz, ftp.sra.ebi.ac.uk/vol1/fastq/SRR136/021/SRR13615821/SRR13615821_2.fastq.gz
12/13/2023 11:39:52 AM INFO: Running command: ascp -T -l 300m -P33001 -k 2 -i /expanse/projects/jcl110/miniconda3/lib/python3.11/site-packages/kingfisher/data/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR136/021/SRR13615821/SRR13615821.fastq.gz .
12/13/2023 11:39:52 AM WARNING: Error downloading from ENA with ASCP: Command ascp -T -l 300m -P33001 -k 2 -i /expanse/projects/jcl110/miniconda3/lib/python3.11/site-packages/kingfisher/data/asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR136/021/SRR13615821/SRR13615821.fastq.gz . returned non-zero exit status 127.
STDERR was: b'bash: ascp: command not found\n'STDOUT was: b''
12/13/2023 11:39:52 AM WARNING: Method ena-ascp failed
12/13/2023 11:39:52 AM INFO: Attempting download method aws-http for run SRR13615821 ..
12/13/2023 11:39:53 AM INFO: Found ODP link https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR13615821/SRR13615821
12/13/2023 11:39:53 AM INFO: Downloading .SRA file from AWS Open Data Program HTTP link using aria2c ..

12/13 11:39:53 [NOTICE] Downloading 1 item(s)

12/13 11:39:54 [NOTICE] Allocating disk space. Use --file-allocation=none to disable it. See --file-allocation option in man page for more details.
[#2f4efe 831MiB/852MiB(97%) CN:1 DL:104MiB]
12/13 11:40:05 [NOTICE] Download complete: /expanse/projects/jcl110/VEBA_v2_CaseStudies/Kolyma_Permafrost/Fastq/tmp/SRR13615821.sra

Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
2f4efe|OK  |    89MiB/s|/expanse/projects/jcl110/VEBA_v2_CaseStudies/Kolyma_Permafrost/Fastq/tmp/SRR13615821.sra

Status Legend:
(OK):download completed.
12/13/2023 11:40:05 AM INFO: Download finished, validating ..
12/13/2023 11:40:05 AM INFO: Method aws-http worked.
12/13/2023 11:40:05 AM INFO: Extracting .sra file with fasterq-dump ..
12/13/2023 11:40:46 AM INFO: Output files: SRR13615821_1.fastq, SRR13615821_2.fastq, SRR13615821.fastq
12/13/2023 11:40:46 AM INFO: Kingfisher done.

@wwood
Copy link
Owner

wwood commented Dec 13, 2023

Sometimes this happens when people upload reads that have been QC'd (so some pairs become single-ended reads), I think.

I don't think there is any issue with kingfisher - looks like it is just the NCBI webpage being misleading? EBI has 3 files too:
https://www.ebi.ac.uk/ena/browser/view/SRR13615821

@jolespin
Copy link
Contributor Author

From NCBI Help Desk:

Checking. I'd have to pull the originals to check, but my preliminary guess is that this arises because of asymmetry in the pairs: R2 might have less pairs (perhaps eliminated in aggressive QC?), accounting for a lopsided size difference between R1 and R2, where the "single ended file" is mostly R1.

I see this "three file" behavior with fasterq-dump, but not with a generic fastq-dump

fastq-dump --split-files --origfmt SRR13615821
Rejected 5860790 READS because READLEN < 1
Read 6669018 spots for SRR13615821
Written 6669018 spots for SRR13615821

wc -l SRR13615821*
26676072 SRR13615821_1.fastq
3232912 SRR13615821_2.fastq

grep "^@" SRR13615821_1.fastq | wc -l
6669018 #where 6,669,018 rounds up to the 6.7M reported by SRA pages.
grep "^@" SRR13615821_2.fastq | wc -l
808228

SRA Curator

Not sure if this is helpful or not for you. It's the first time I've experienced an issue like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants