-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error when opening valid .fastq.bz2 (Ran out of data in the middle of a fastq entry. Your file is probably truncated) #48
Comments
I'll take a look at the file but I'm stuck with very limited internet for a while. Do you know whether the file could have been created by concatenating several existing bz2 files initially? I know we had problems with the core decompressors for gzip when that had happened since it technically broke the spec, but the GNU tools were able to cope with it. The test would be that if you decompress the file you have to a raw fastq, and then re-compress that to a bz2 file, is it then able to be read correctly? |
Thank you Simon! It works indeed with the re-compressed bz2. I don't know if the original file was created by concatenation (is there any way to find it out for sure from the file itself?). Unfortunately I received those files from another source so I can't change how they are created. |
I did some testing and it looks like it is the issue of having multiple headers in the middle of the file.
So concatenating bzip2 files still works when decompressing when using the unix bzip tools, but the java bzip2 library we're using closes the stream when it hits the first end marker on the stream which is why it jumps from 45% complete to 100%. I'm not sure why yours would crash as I'd think that an incomplete processing would be the more likely response. I know we had to work round a similar limitation in gzip compression which is why we have our own class for gzip decomression. I might be able to use the same strategy to work around the bzip decompressors limitations, or it's possible that there is an updated version which can deal with this. |
I had a look and it seems the Jbzip2 library we're currently using doesn't support this, and seems to be unmaintained. However the apache commons compress library will work and in the documentation it says:
So if we can switch to that it looks like we can work round this problem. We might also be able to get rid of the kludge we have for gzip streams. |
We ran into this exact issue with files we downloaded from a collaborator. The original .bz2 fails with but we have no problem with the decompressed fastq, or a recompressed file in .bz2 or .gz.
Our short term solution is to recompress everything into .gz files. |
Is there a patch available for this (i.e. "process only the first file of a concatenated bzip2")? In my case I can see in the log that |
Trying to run FastQC for my bz2-compressed file
fastqc 08asp.fastq.bz2
- I get the following error:I checked the integrity of the archive with
bzip2 -t 08asp.fastq.bz2
; more importantly, if I first decompress the very same file (bzip2 -d 08asp.fastq.bz2
) and then runfastqc 08asp.fastq
- it works without any issues.Sample file can be temporarily downloaded here (181 Mb).
The text was updated successfully, but these errors were encountered: