how to compare with the last the column for the same chromosome name? #1237

Fei-Guang · 2016-11-03T07:39:07Z

$ samtools view rdd1.bam |head -1

 NB501244AR:119:HJY3WBGXY:3:11508:7857:8792      16      chr1    10002   0       40M     *       0       0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC        ///E////6E////EEAEEE/EEEEEEEEEEEEAEAAA/A        XT:A:R  NM:i:0  X0:i:594        XM:i:0  XO:i:0  XG:i:0  MD:Z:40

$ cat rdd2.kmer | head -10
chr1 1 0.42235 0.01501 10001 110000
chr1 2 0.41104 0.01254 60001 160000
chr1 3 0.42826 0.00282 110001 177417 227418 260000
chr1 4 0.4151 0.00288 160001 177417 227418 267719 317720 360000
chr1 5 0.39534 0.00166 260001 267719 317720 410000
chr1 6 0.42745 0.0008 360001 460000
chr1 7 0.47991 0.0041 410001 471368 521369 560000
chr1 8 0.44 0.00977 460001 471368 521369 610000
chr1 9 0.39048 0.00875 560001 660000
chr1 10 0.42579 0.00624 610001 710000

what i want to do is to compare the 4th column of rdd1.bam with the 5th column and last column of rdd2.kmer if both rdd has the same chromosome name(The third column of rdd1 should be same as the first column of rdd2)

if the 4th column of rdd1.bam in range [the 5th column of rdd2.kmer, the last column of rdd2.kmer]
the count[chr1] ++

how can i do it with adam0.20?

The text was updated successfully, but these errors were encountered:

fnothaft · 2016-11-03T16:42:21Z

Hi @Fei-Guang! How do you know which line of the second file (rdd2.kmer) corresponds to which read? If the first read in rdd1.bam should line up with the first line of rdd2.kmer, I'd do:

import org.bdgenomics.adam.rdd.ADAMContext._
val reads = sc.loadAlignments("rdd1.bam")
val lines = sc.textFile("rdd2.kmer")

// conceptually, we could just do reads.rdd.zip(lines), but!
// we aren't guaranteed that both RDDs have the same number
// of records in each partition, so zipWithIndex followed by join
// is (slower, but) safer
val zippedLinesAndReads = reads.rdd
  .zipWithIndex
  .map(_.swap)
  .join(lines.zipWithIndex.map(_.swap))

val countsByChromosome = zippedLinesAndReads.flatMap(kv => {
  val (_, (read, line)) = kv

  // get the range from the rdd2.kmer file
  val columns = line.split("\t[") // i assume this is tab delimited?
  val start = columns(4).toLong
  val end = columns(5).toLong

  // is the alignment start position between the start and end pos from the line?
  // if yes, emit the chromosome name and 1
  if (start <= read.getStart && read.getStart < end) {
    Some((read.getContigName, 1))
  } else {
    None
  }
}).reduceByKeyLocally(_ + _)

fnothaft · 2016-11-03T16:42:59Z

What's the format of the rdd2.kmer file? If it is a typical feature format (e.g., BED, GTF), this code would be slightly cleaner.

Fei-Guang · 2016-11-04T01:33:30Z

hello, @fnothaft

Both rdd should have the same chromosome name, it means
The third column of rdd1 should be same as the first column of rdd2.

take the first line of rdd1 as an example, the chromosome name is chr1.

Many thanks for the help

Fei-Guang · 2016-11-04T02:04:37Z

hello @fnothaft

the format of the rdd2.kmer is defined by our engineer, not in BED and GTF format

fnothaft · 2016-11-04T03:16:09Z

Hi @Fei-Guang! Where are you running that? Are you running that in spark-shell? ADAM relies on a custom Kryo serializer Registrator for serialization. If you use ./bin/adam-shell, this starts a Spark shell where the serialization config (and classpath) are set up.

Fei-Guang · 2016-11-04T04:15:24Z

hello @fnothaft

what should be changed in the code if the line of the second file (rdd2.kmer) corresponds the line of the rdd1.bam by the chromosome name?

fnothaft · 2016-11-04T16:06:28Z

@Fei-Guang I think you'd want:

import org.bdgenomics.adam.rdd.ADAMContext._
val reads = sc.loadAlignments("rdd1.bam")
val lines = sc.textFile("rdd2.kmer")
  .map(line => {
  // get the range from the rdd2.kmer file
  val columns = line.split("\t[") // i assume this is tab delimited?
  val contig = columns(0)
  val start = columns(4).toLong
  val end = columns(5).toLong

  (contig, (start, end))
})

// join against the other RDD keyed by contig name
val zippedLinesAndReads = reads.rdd.flatMap(read => {
  // check whether the read is mapped, lest we get a null pointer exception
  if (read.getReadMapped) {
    Some((read.getContigName, read.getStart))
  } else {
    None
  }
}).join(lines)

val countsByChromosome = zippedLinesAndReads.flatMap(kv => {
  val (contigName, (readStart, (start, end))) = kv

  // is the alignment start position between the start and end pos from the line?
  // if yes, emit the chromosome name and 1
  if (start <= readStart && readStart < end) {
    Some((contigName, 1))
  } else {
    None
  }
}).reduceByKeyLocally(_ + _)

Fei-Guang changed the title ~~how to compare with the last the column?~~ how to compare with the last the column for the same chromosome name? Nov 4, 2016

Fei-Guang closed this as completed Nov 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to compare with the last the column for the same chromosome name? #1237

how to compare with the last the column for the same chromosome name? #1237

Fei-Guang commented Nov 3, 2016 •

edited

Loading

fnothaft commented Nov 3, 2016

fnothaft commented Nov 3, 2016

Fei-Guang commented Nov 4, 2016 •

edited

Loading

Fei-Guang commented Nov 4, 2016

fnothaft commented Nov 4, 2016

Fei-Guang commented Nov 4, 2016 •

edited

Loading

fnothaft commented Nov 4, 2016

how to compare with the last the column for the same chromosome name? #1237

how to compare with the last the column for the same chromosome name? #1237

Comments

Fei-Guang commented Nov 3, 2016 • edited Loading

fnothaft commented Nov 3, 2016

fnothaft commented Nov 3, 2016

Fei-Guang commented Nov 4, 2016 • edited Loading

Fei-Guang commented Nov 4, 2016

fnothaft commented Nov 4, 2016

Fei-Guang commented Nov 4, 2016 • edited Loading

fnothaft commented Nov 4, 2016

Fei-Guang commented Nov 3, 2016 •

edited

Loading

Fei-Guang commented Nov 4, 2016 •

edited

Loading

Fei-Guang commented Nov 4, 2016 •

edited

Loading