Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-384] Adds import from FASTQ. #385

Merged
merged 1 commit into from
Oct 1, 2014

Conversation

fnothaft
Copy link
Member

Resolves #384. Adds:

  • Load from "single ended" FASTQ
  • Load from interleaved FASTQ
  • Load from non-interleaved paired end (two file) FASTQ

Load from single ended FASTQ and interleaved FASTQ are handled seamlessly by the ADAMContext.adamLoad method. Since paired ended (but non-interleaved) data requires two file paths, I haven't added it into adamLoad; thus, it sits on its own.

I wrote our own FASTQ input format instead of using the one in Hadoop-BAM; theirs is only compatible with Hadoop 1, performs unnecessary parsing, and doesn't seem to pick splits correctly anyways. The SingleFastqInputFormat is almost a direct copy of the InterleavedFastqInputFormat, which we've tested pretty well on clusters.

* quality string?
*
* For now I'm going to assume single-line sequences. This works for our sequencing
* application. We'll see if someone complains in other applications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possibility of multi-line FASTQ sequences is a well known bug in the format definition -- for precisely the reason you point out here. +1 on your call to "ignore, and revisit if someone complains."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, bug/"feature". Alas, the curse of text based file formats.

Anyways, we'll need to fix that (sooner rather than later); I chose to not tackle it now, because Hadoop-BAM hadn't tackled it either.

@tdanford
Copy link
Contributor

tdanford commented Oct 1, 2014

See my comments inline; this PR also needs a rebase off the latest master, @fnothaft. Otherwise, it's looking pretty good to me!

@fnothaft
Copy link
Member Author

fnothaft commented Oct 1, 2014

Changes made and code is rebased. Thanks for the review, @tdanford !

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/270/

massie added a commit that referenced this pull request Oct 1, 2014
[ADAM-384] Adds import from FASTQ.
@massie massie merged commit 99e6f2d into bigdatagenomics:master Oct 1, 2014
@massie
Copy link
Member

massie commented Oct 1, 2014

Thanks, Frank!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import data from FASTQ
4 participants