-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BAMHeader not set when running on a cluster #676
Comments
IIRC @fnothaft and I could not convince ourselves that the VCF header impl works like a year ago, though he claimed it was working for him (on a cluster) and I couldn't get it to work. |
I just ran into this issue as well, btw. |
#353 has some discussion of the VCF version of this issue |
I'll try my hand at porting a similar fix today and see if I have any more luck |
Just tried with master...ryan-williams:sam-attempt, still see the issue. Will dig through logs and see if I can verify whether the executors attempted to set the header. |
heh, seems like the errors there were from double- and triple-setting the header on each executor (once per partition), e.g.:
Forcibly clearing and then re-setting the header on each partition seems to do the trick, e.g.
I'll open up a PR promptly. |
fixes bigdatagenomics#676 - formatting / unnecessary `.toString` nits - one more header fix - whitespace nits
Interestingly, I just saw a failure when writing a SAM file, so there might be some nondeterministic way that this remains unfixed. Here are the YARN container logs of all my executors. This gist has the logs of the two executors where relevant things happened: IDs 161 and 69. The only log lines about headers come from these two executors, and they were running on the same host (out of 94 in the application); probably a coincidence? I ran Timeline:
Something for us to keep an eye on / think about possible explanations for! |
I'm not 100% sure (obviously) but my money is on Hadoop-BAM being borked. |
I thought I was seeing this again but decided the behavior is different enough that I'm going to file a new issue. |
Bad news: I ran into this again today. In the no-op Generally, tasks get placed to optimize locality, which is likely to be stable in some cases, but today I ran into trouble because the no-op- Following are more details about my app, partitions, and executors that exposed this. I was attempting to load a At the time of the no-op- Then, the The latter 4 executors failed immediately due to not having the BAM header set, and then retried the failed tasks in such rapid succession that they hit my I haven't fully digested the implications of this or thought much about ways to fix it, but I'm going to reopen this as I think it's real and still live! |
I think we have a 100% concrete, guaranteed fix in the single file path. Specifically, see 7d4b409. It's an ugly fix: we write the header to HDFS as a file, and then read it when creating a record reader. It should be straightforward to port over to the "normal" (write a BAM as many partitions) code path, I just haven't done it. |
Resolved by #964. |
This isn't yet resolved for sharded files, I'll have a PR resolving this for sharded files this AM. |
Resolves bigdatagenomics#676. In bigdatagenomics#964, we resolved the "header not set" issues for single file SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and VCF.
Resolves bigdatagenomics#676. In bigdatagenomics#964, we resolved the "header not set" issues for single file SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and VCF.
Resolves bigdatagenomics#676. In bigdatagenomics#964, we resolved the "header not set" issues for single file SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and VCF.
When running
transform
with.bam
file as the output, I see the following error:I tried to fix this with a similar strategy used by @fnothaft for the VCF header (see https://github.com/bigdatagenomics/adam/compare/master...hammerlab:bam-save-header?expand=1), but did not seem to take. Any ideas on this?
The text was updated successfully, but these errors were encountered: