Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add biologist targeted section to the README #497

Closed
laserson opened this issue Nov 21, 2014 · 3 comments
Closed

Add biologist targeted section to the README #497

laserson opened this issue Nov 21, 2014 · 3 comments
Assignees
Milestone

Comments

@laserson
Copy link
Contributor

After talking with someone at Strata, they mentioned that any biologists we interest who end up looking up ADAM get to the README and find a very computer science-y intro to ADAM. We should add a little section accessible to biologists about why ADAM is exciting.

@fnothaft fnothaft added this to the 0.21.0 milestone Jul 20, 2016
This was referenced Jul 20, 2016
@tverbeiren
Copy link
Contributor

Draft suggestion kindly provided by Joke Reumers:

Over the last decade, DNA and RNA sequencing has evolved from an expensive, labor intensive method to a cheap commodity. The consequence of this is generation of massive amounts of genomic and transcriptomic data. Typically, tools to process and interpret these data are developed at academic labs, with a focus on excellence of the results generated, not on scalability and interoperability. A typical "sequencing pipeline" consists of a string of tools going from quality control, mapping, mapped read preprocessing, to variant calling or quantification, depending on the application at hand. Concretely, this usually means that such a pipeline is a string of tools, glued together by scripts or workflow engines, with data written to files in each step.

This approach entails three main bottlenecks: 1) scaling the pipeline comes down to scaling each of the individual tools, 2) the stability of the pipeline heavily depends on the consistency of the intermediate file formats, and 3) writing to and reading from disk is a major slow-down.
We propose here a transformative solution for these problems, by replacing ad hoc pipelines by the ADAM framework, developed in the Apache Spark ecosystem.

ADAM provides specialized file formats for the standard data structures used in genomics analysis: mapped reads (typically stored as .bam files), representation of genomic regions (.bed files), and variants (.vcf files), using Avro and Parquet. This allows to use the in-memory cluster computing functionality of Apache Spark, ensuring efficient and fault-tolerant distribution based on data parallelism, without the intermediate disk operations required in classical distributed approaches.

Furthermore, the ADAM-Spark approach comes with an additional benefit. Typically, the endpoint of a sequencing pipeline is a file with processed data for a single sample: e.g. variants for DNA sequencing, read counts for RNA sequencing, .... However, the real endpoint of a sequencing experiment initiated by an investigator is interpretation of these data in a certain context. This usually translates into (statistical) analysis of multiple samples, connection with (clinical) metadata, interactive visualization, using data science tools such as R, Python, Tableau and Spotfire. In addition to scalable distributed processing, Spark also allows such interactive data analysis in the form of analysis notebooks (Spark Notebook or Zeppelin), or direct connection to the data in R and Python.

@fnothaft
Copy link
Member

fnothaft commented Dec 9, 2016

I think this looks really good! @tverbeiren can you open a PR adding this to the README.md? Let me know if you're short on time and I can do it.

@heuermh
Copy link
Member

heuermh commented Dec 12, 2016

Fixed by #1310

@heuermh heuermh closed this as completed Dec 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants