Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding several simple normalizations. #311

Merged
merged 1 commit into from
Jul 29, 2014

Conversation

fnothaft
Copy link
Member

Added two normalizations:

  • Normalization via Z Score
  • Target length normalization (e.g., RPKM)

@tdanford @carlyeks and I discussed the best place for this, and thought that it made the most sense to have it inside of ADAM, as downstream tools like RNAdam will depend on it, and the normalizations are useful primitives across many omics algorithms. Also, I would entertain any comments about whether org.bdgenomics.adam.rdd.normalization is the best package for this.

*
* @see pkn
*/
def apply[I <: Interval, T](rdd: RDD[(Double, I, T)]): RDD[(Double, I, T)] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, why the (Double, I, T) ordering here? I would have expected most of these RDDs to have a form keyed off the interval (i.e. (I, Double)) and if you wanted to carry along extra information, maybe a key-value where the value was a pair (i.e. (I, (Double, T))).

@tdanford
Copy link
Contributor

I'm still reviewing this @fnothaft, sorry for the delay...

@carlyeks
Copy link
Member

Ping @tdanford.

@tdanford
Copy link
Contributor

Ping @fnothaft -- see my question, above, about the (Double, I, T) type signature.

@tdanford
Copy link
Contributor

Ping @fnothaft :-)

@fnothaft
Copy link
Member Author

@tdanford just addressed your comment, and rebased the change on master.

@carlyeks
Copy link
Member

Jenkins, retest this please.

* @tparam T Type of data passed along.
*/
def apply[T](rdd: RDD[(Double, T)]): RDD[(Double, T)] = {
val cachedRdd = rdd.cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, Frank, why (here, and above as well) the explicit call to cache and unpersist? Given that these RDDs appear to only be used once...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdanford cachedRdd is used in 1 count and 3 separate map calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No description provided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed; all your base are belong to me now.

@tdanford
Copy link
Contributor

Jenkins, retest this please.

@tdanford
Copy link
Contributor

I'm going to merge this anyway, since it builds in my hands -- and the tests that are failing aren't the ones you've added here.

tdanford added a commit that referenced this pull request Jul 29, 2014
Adding several simple normalizations.
@tdanford tdanford merged commit 70d81f8 into bigdatagenomics:master Jul 29, 2014
@tdanford
Copy link
Contributor

Thanks, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants