Reverting README.md back to 17d1d4d.

bigdatagenomics · Nov 15, 2014 · 60ca9d4 · 60ca9d4
1 parent 17cf66b
commit 60ca9d4
Showing 1 changed file with 259 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -1,13 +1,266 @@
-# ADAM
+ADAM
+====
+A genomics processing engine and specialized file format built using [Apache Avro](http://avro.apache.org), 
+[Apache Spark](http://spark.incubator.apache.org/) and [Parquet](http://parquet.io/). Apache 2 licensed.
 
-A genome analysis platform built on Apache Hadoop, Spark, Parquet and Avro. Apache 2 licensed.
+# Introduction
 
-[![Build Status](https://amplab.cs.berkeley.edu/jenkins/buildStatus/icon?job=ADAM)](https://amplab.cs.berkeley.edu/jenkins/job/ADAM/)
+Current genomic file formats are not designed for
+distributed processing. ADAM addresses this by explicitly defining data
+formats as [Apache Avro](http://avro.apache.org) objects and storing them in 
+[Parquet](http://parquet.io) files. [Apache Spark](http://spark.incubator.apache.org/)
+is used as the cluster execution system.
 
-To generate documentation,
+## Explicitly defined format
 
+The [Sequencing Alignment Map (SAM) and Binary Alignment Map (BAM)
+file specification](http://samtools.sourceforge.net/SAM1.pdf) defines a data format 
+for storing reads from aligners. The specification is well-written but provides
+no tools for developers to implement the format. Developers have to hand-craft 
+source code to encode and decode the records which is error prone and an unneccesary
+hassle.
+
+In contrast, the [ADAM specification for storing reads]
+(https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl)
+is defined in the Avro Interface Description Language (IDL) which is directly converted
+into source code. Avro supports a number of computer languages. ADAM uses Java; you could 
+just as easily use this Avro IDL description as the basis for a Python project. Avro
+currently supports c, c++, csharp, java, javascript, php, python and ruby. 
+
+## Ready for distributed processing
+
+The SAM/BAM format is record-oriented with a single record for each read. However,
+the typical data access pattern is column oriented, e.g. search for bases at a
+specific position in a reference genome. The BAM specification tries to support
+this pattern by defining a format for a separate index file. However, this index
+needs to be regenerated anytime your BAM file changes which is costly. The index
+does help keep the cost down on file seeks but the columnar store ADAM uses reduces
+the cost of seeks even more.
+
+Once you convert your BAM file to ADAM, it can be directly accessed by 
+[Hadoop Map-Reduce](http://hadoop.apache.org), [Spark](http://spark-project.org/), 
+[Shark](http://shark.cs.berkeley.edu), [Impala](https://github.com/cloudera/impala), 
+[Pig](http://pig.apache.org), [Hive](http://hive.apache.org), whatever. Using
+ADAM will unlock your genomic data and make it available to a broader range of
+systems.
+
+# Getting Started
+
+## Installation
+
+You will need to have [Maven](http://maven.apache.org/) installed in order to build ADAM.
+
+> **Note:** The default configuration is for Hadoop 2.2.0. If building against a different
+> version of Hadoop, please edit the build configuration in the `<properties>` section of
+> the `pom.xml` file.
+
+```
+$ git clone https://github.com/bigdatagenomics/adam.git
+$ cd adam
+$ export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
+$ mvn clean package -DskipTests
+...
+[INFO] ------------------------------------------------------------------------
+[INFO] BUILD SUCCESS
+[INFO] ------------------------------------------------------------------------
+[INFO] Total time: 9.647s
+[INFO] Finished at: Thu May 23 15:50:42 PDT 2013
+[INFO] Final Memory: 19M/81M
+[INFO] ------------------------------------------------------------------------
+```
+
+You might want to take a peek at the `scripts/jenkins-test` script and give it a run. It will fetch a mouse chromosome, encode it to ADAM
+reads and pileups, run flagstat, etc. We use this script to test that ADAM is working correctly.
+
+## Running ADAM
+
+ADAM is packaged via [appassembler](http://mojo.codehaus.org/appassembler/appassembler-maven-plugin/) and includes all necessary
+dependencies
+
+You might want to add the following to your `.bashrc` to make running `adam` easier:
+
+```
+alias adam-local="bash ${ADAM_HOME}/adam-cli/target/appassembler/bin/adam"
+alias adam-submit="${ADAM_HOME}/bin/adam-submit"
+alias adam-shell="${ADAM_HOME}/bin/adam-shell"
 ```
-cd docs
-./build.sh
+
+`$ADAM_HOME` should be the path to where you have checked ADAM out on your local filesystem. 
+The first alias should be used for running ADAM jobs that operate locally. The latter two aliases 
+call scripts that wrap the `spark-submit` and `spark-shell` commands to set up ADAM. You'll need
+to have the Spark binaries on your system; prebuilt binaries can be downloaded from the
+[Spark website](http://spark.apache.org/downloads.html). Currently, we build for
+[Spark 1.1, and Hadoop 2.3 (CDH5)](http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.3.tgz).
+
+Once this alias is in place, you can run adam by simply typing `adam-local` at the commandline, e.g.
+
 ```
+$ adam-local
+
+     e            888~-_              e                 e    e
+    d8b           888   \            d8b               d8b  d8b
+   /Y88b          888    |          /Y88b             d888bdY88b
+  /  Y88b         888    |         /  Y88b           / Y88Y Y888b
+ /____Y88b        888   /         /____Y88b         /   YY   Y888b
+/      Y88b       888_-~         /      Y88b       /          Y888b
+
+Choose one of the following commands:
+
+           transform : Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations
+            flagstat : Print statistics on reads in an ADAM file (similar to samtools flagstat)
+           reads2ref : Convert an ADAM read-oriented file to an ADAM reference-oriented file
+             mpileup : Output the samtool mpileup text from ADAM reference-oriented data
+               print : Print an ADAM formatted file
+   aggregate_pileups : Aggregate pileups in an ADAM reference-oriented file
+            listdict : Print the contents of an ADAM sequence dictionary
+             compare : Compare two ADAM files based on read name
+    compute_variants : Compute variant data from genotypes
+            bam2adam : Single-node BAM to ADAM converter (Note: the 'transform' command can take SAM or BAM as input)
+            adam2vcf : Convert an ADAM variant to the VCF ADAM format
+            vcf2adam : Convert a VCF file to the corresponding ADAM format
+
+```
+
+ADAM outputs all the commands that are available for you to run. To get
+help for a specific command, run `adam-local <command>` without any additional arguments.
+
+````
+$ adam-submit transform
+Argument "INPUT" is required
+ INPUT                                                           : The ADAM, BAM or SAM file to apply the transforms to
+ OUTPUT                                                          : Location to write the transformed data in ADAM/Parquet format
+ -coalesce N                                                     : Set the number of partitions written to the ADAM output directory
+ -dump_observations VAL                                          : Local path to dump BQSR observations to. Outputs CSV format.
+ -h (-help, --help, -?)                                          : Print help
+ -known_indels VAL                                               : VCF file including locations of known INDELs. If none is provided, default
+                                                                   consensus model will be used.
+ -known_snps VAL                                                 : Sites-only VCF giving location of known SNPs
+ -log_odds_threshold N                                           : The log-odds threshold for accepting a realignment. Default value is 5.0.
+ -mark_duplicate_reads                                           : Mark duplicate reads
+ -max_consensus_number N                                         : The maximum number of consensus to try realigning a target region to. Default
+                                                                   value is 30.
+ -max_indel_size N                                               : The maximum length of an INDEL to realign to. Default value is 500.
+ -max_target_size N                                              : The maximum length of a target region to attempt realigning. Default length is
+                                                                   3000.
+ -parquet_block_size N                                           : Parquet block size (default = 128mb)
+ -parquet_compression_codec [UNCOMPRESSED | SNAPPY | GZIP | LZO] : Parquet compression codec
+ -parquet_disable_dictionary                                     : Disable dictionary encoding
+ -parquet_logging_level VAL                                      : Parquet logging level (default = severe)
+ -parquet_page_size N                                            : Parquet page size (default = 1mb)
+ -print_metrics                                                  : Print metrics to the log on completion
+ -qualityBasedTrim                                               : Trims reads based on quality scores of prefix/suffixes across read group.
+ -qualityThreshold N                                             : Phred scaled quality threshold used for trimming. If omitted, Phred 20 is used.
+ -realign_indels                                                 : Locally realign indels present in reads.
+ -recalibrate_base_qualities                                     : Recalibrate the base quality scores (ILLUMINA only)
+ -repartition N                                                  : Set the number of partitions to map data to
+ -sort_fastq_output                                              : Sets whether to sort the FASTQ output, if saving as FASTQ. False by default.
+                                                                   Ignored if not saving as FASTQ.
+ -sort_reads                                                     : Sort the reads by referenceId and read position
+ -trimBeforeBQSR                                                 : Performs quality based trim before running BQSR. Default is to run quality based
+                                                                   trim after BQSR.
+ -trimFromEnd N                                                  : Trim to be applied to end of read.
+ -trimFromStart N                                                : Trim to be applied to start of read.
+ -trimReadGroup VAL                                              : Read group to be trimmed. If omitted, all reads are trimmed.
+ -trimReads                                                      : Apply a fixed trim to the prefix and suffix of all reads/reads in a specific read
+                                                                   group.
+
+````
+
+If you followed along above, now try making your first `.adam` file like this:
+
+````
+adam-submit transform $ADAM_HOME/adam-core/src/test/resources/small.sam /tmp/small.adam
+````
+
+... and if you didn't obtain your copy of adam from github, you can [grab `small.sam` from here](https://raw.githubusercontent.com/bigdatagenomics/adam/master/adam-core/src/test/resources/small.sam).
+
+
+# flagstat
+
+Once you have data converted to ADAM, you can gather statistics from the ADAM file using `flagstat`.
+This command will output stats identically to the samtools `flagstat` command.
+
+If you followed along above, now try gathering some statistics:
+
+````
+$ adam-local flagstat /tmp/small.adam
+20 + 0 in total (QC-passed reads + QC-failed reads)
+0 + 0 primary duplicates
+0 + 0 primary duplicates - both read and mate mapped
+0 + 0 primary duplicates - only read mapped
+0 + 0 primary duplicates - cross chromosome
+0 + 0 secondary duplicates
+0 + 0 secondary duplicates - both read and mate mapped
+0 + 0 secondary duplicates - only read mapped
+0 + 0 secondary duplicates - cross chromosome
+20 + 0 mapped (100.00%:0.00%)
+0 + 0 paired in sequencing
+0 + 0 read1
+0 + 0 read2
+0 + 0 properly paired (0.00%:0.00%)
+0 + 0 with itself and mate mapped
+0 + 0 singletons (0.00%:0.00%)
+0 + 0 with mate mapped to a different chr
+0 + 0 with mate mapped to a different chr (mapQ>=5)
+````
+
+In practice, you'll find that the ADAM `flagstat` command takes orders of magnitude less
+time than samtools to compute these statistics. For example, on a MacBook Pro
+`flagstat NA12878_chr20.bam` took 17 seconds to run while `samtools flagstat NA12878_chr20.bam`
+took 55 seconds. On larger files, the difference in speed is even more dramatic. ADAM is faster
+because it's multi-threaded and distributed and uses a columnar storage format (with a
+projected schema that only materializes the read flags instead of the whole read). 
+
+# count_kmers
+
+You can also use ADAM to count all K-mers present across all reads in the
+`.adam` file using `count_kmers`.  Try this:
+
+````
+$ adam-local count_kmers /tmp/small.adam /tmp/kmers.adam 10
+$ head /tmp/kmers.adam/part-*
+TTTTAAGGTT, 1
+TTCCGATTTT, 1
+GAGCAGCCTT, 1
+CCTGCTGTAT, 1
+AATTGGCACT, 1
+GGCCAGGACT, 1
+GCAGTCCCTC, 1
+AACTTTGAAT, 1
+GATGACGTGG, 1
+CTGTCCCTGT, 1
+````
+
+Each line contains part-* file(s) with line-based records that contain two
+comma-delimited values.  The first value is the K-mer itself and the second
+value is the number of times that K-mer occurred in the input file.  
+
+# Running on a cluster
+
+We provide the `adam-submit` and `adam-shell` commands under the `bin` directory. These can
+be used to submit ADAM jobs to a spark cluster, or to run ADAM interactively.
+
+## Running Plugins
+
+ADAM allows users to create plugins via the [ADAMPlugin](https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/plugins/ADAMPlugin.scala)
+trait. These plugins are then imported using the Java classpath at runtime. To add to the classpath when
+using appassembler, use the `$CLASSPATH_PREFIX` environment variable. For an example of how to use
+the plugin interface, please see the [adam-plugins repo](https://github.com/heuermh/adam-plugins).
+
+# Getting In Touch
+
+## Mailing List
+
+[The ADAM mailing list](https://groups.google.com/forum/#!forum/adam-developers) is a good
+way to sync up with other people who use ADAM including the core developers. You can subscribe
+by sending an email to `adam-developers+subscribe@googlegroups.com` or just post using
+the [web forum page](https://groups.google.com/forum/#!forum/adam-developers).
+
+## IRC Channel
+
+A lot of the developers are hanging on the [#adamdev](http://webchat.freenode.net/?channels=adamdev)
+freenode.net channel. Come join us and ask questions.
+
+# License
 
+ADAM is released under an [Apache 2.0 license](LICENSE.txt).