Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to compare with the last the column for the same chromosome name? #1237

Closed
Fei-Guang opened this issue Nov 3, 2016 · 7 comments
Closed

Comments

@Fei-Guang
Copy link

Fei-Guang commented Nov 3, 2016

$ samtools view rdd1.bam |head -1

 NB501244AR:119:HJY3WBGXY:3:11508:7857:8792      16      chr1    10002   0       40M     *       0       0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC        ///E////6E////EEAEEE/EEEEEEEEEEEEAEAAA/A        XT:A:R  NM:i:0  X0:i:594        XM:i:0  XO:i:0  XG:i:0  MD:Z:40 

$ cat rdd2.kmer | head -10
chr1 1 0.42235 0.01501 10001 110000
chr1 2 0.41104 0.01254 60001 160000
chr1 3 0.42826 0.00282 110001 177417 227418 260000
chr1 4 0.4151 0.00288 160001 177417 227418 267719 317720 360000
chr1 5 0.39534 0.00166 260001 267719 317720 410000
chr1 6 0.42745 0.0008 360001 460000
chr1 7 0.47991 0.0041 410001 471368 521369 560000
chr1 8 0.44 0.00977 460001 471368 521369 610000
chr1 9 0.39048 0.00875 560001 660000
chr1 10 0.42579 0.00624 610001 710000

what i want to do is to compare the 4th column of rdd1.bam with the 5th column and last column of rdd2.kmer if both rdd has the same chromosome name(The third column of rdd1 should be same as the first column of rdd2)

if the 4th column of rdd1.bam in range [the 5th column of rdd2.kmer, the last column of rdd2.kmer]
the count[chr1] ++

how can i do it with adam0.20?

@fnothaft
Copy link
Member

fnothaft commented Nov 3, 2016

Hi @Fei-Guang! How do you know which line of the second file (rdd2.kmer) corresponds to which read? If the first read in rdd1.bam should line up with the first line of rdd2.kmer, I'd do:

import org.bdgenomics.adam.rdd.ADAMContext._
val reads = sc.loadAlignments("rdd1.bam")
val lines = sc.textFile("rdd2.kmer")

// conceptually, we could just do reads.rdd.zip(lines), but!
// we aren't guaranteed that both RDDs have the same number
// of records in each partition, so zipWithIndex followed by join
// is (slower, but) safer
val zippedLinesAndReads = reads.rdd
  .zipWithIndex
  .map(_.swap)
  .join(lines.zipWithIndex.map(_.swap))

val countsByChromosome = zippedLinesAndReads.flatMap(kv => {
  val (_, (read, line)) = kv

  // get the range from the rdd2.kmer file
  val columns = line.split("\t[") // i assume this is tab delimited?
  val start = columns(4).toLong
  val end = columns(5).toLong

  // is the alignment start position between the start and end pos from the line?
  // if yes, emit the chromosome name and 1
  if (start <= read.getStart && read.getStart < end) {
    Some((read.getContigName, 1))
  } else {
    None
  }
}).reduceByKeyLocally(_ + _)

@fnothaft
Copy link
Member

fnothaft commented Nov 3, 2016

What's the format of the rdd2.kmer file? If it is a typical feature format (e.g., BED, GTF), this code would be slightly cleaner.

@Fei-Guang
Copy link
Author

Fei-Guang commented Nov 4, 2016

hello, @fnothaft

Both rdd should have the same chromosome name, it means
The third column of rdd1 should be same as the first column of rdd2.

take the first line of rdd1 as an example, the chromosome name is chr1.

Many thanks for the help

@Fei-Guang
Copy link
Author

hello @fnothaft

the format of the rdd2.kmer is defined by our engineer, not in BED and GTF format

@Fei-Guang Fei-Guang changed the title how to compare with the last the column? how to compare with the last the column for the same chromosome name? Nov 4, 2016
@fnothaft
Copy link
Member

fnothaft commented Nov 4, 2016

Hi @Fei-Guang! Where are you running that? Are you running that in spark-shell? ADAM relies on a custom Kryo serializer Registrator for serialization. If you use ./bin/adam-shell, this starts a Spark shell where the serialization config (and classpath) are set up.

@Fei-Guang
Copy link
Author

Fei-Guang commented Nov 4, 2016

hello @fnothaft

what should be changed in the code if the line of the second file (rdd2.kmer) corresponds the line of the rdd1.bam by the chromosome name?

@fnothaft
Copy link
Member

fnothaft commented Nov 4, 2016

@Fei-Guang I think you'd want:

import org.bdgenomics.adam.rdd.ADAMContext._
val reads = sc.loadAlignments("rdd1.bam")
val lines = sc.textFile("rdd2.kmer")
  .map(line => {
  // get the range from the rdd2.kmer file
  val columns = line.split("\t[") // i assume this is tab delimited?
  val contig = columns(0)
  val start = columns(4).toLong
  val end = columns(5).toLong

  (contig, (start, end))
})

// join against the other RDD keyed by contig name
val zippedLinesAndReads = reads.rdd.flatMap(read => {
  // check whether the read is mapped, lest we get a null pointer exception
  if (read.getReadMapped) {
    Some((read.getContigName, read.getStart))
  } else {
    None
  }
}).join(lines)

val countsByChromosome = zippedLinesAndReads.flatMap(kv => {
  val (contigName, (readStart, (start, end))) = kv

  // is the alignment start position between the start and end pos from the line?
  // if yes, emit the chromosome name and 1
  if (start <= readStart && readStart < end) {
    Some((contigName, 1))
  } else {
    None
  }
}).reduceByKeyLocally(_ + _)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants