Skip to content
This repository has been archived by the owner on Sep 8, 2018. It is now read-only.

Don't encourage use of RPKM #1

Closed
blahah opened this issue Feb 26, 2014 · 2 comments
Closed

Don't encourage use of RPKM #1

blahah opened this issue Feb 26, 2014 · 2 comments

Comments

@blahah
Copy link

blahah commented Feb 26, 2014

This tool looks excellent, but by using RPKM it will be mathematically wrong and less effective than it potentially could be. You should be operating on library-size normalised effective counts, or on TPM.

RPKMs are not comparable between samples. This is a fact contained in the definition of RPKM because it depends on the mean expressed transcript length. This was first pointed out explicitly in the RSEM paper (section 1.1.1). It was then restated (without attribution!) more explicitly in Wagner et al.. It has also been demonstrated empirically to make a difference.

For further context see Lior Pachter's keynote in which he apologises for making Cufflinks use RPKM which has led to it being widely misused: http://www.youtube.com/watch?v=5NiFibnbE8o.

Any reviewer who has been paying attention to the literature will pick this up. Admittedly, there aren't many such reviewers, but it's also important to maximise the correctness and utility of your software.

@mgonzalezporta
Copy link
Owner

Thanks a lot for your comments.

It is important to note that SwitshSeq does not attempt to make any claims on significance based on RPKM/FPKM values, and that the user is pointed to alternative tools like DEXSeq and MMDIFF for that. Then, given the initial matrix of expression values, SwitchSeq identifies the most abundant transcript within each gene for each given sample, and reports cases where the identity of this transcript differs across conditions. Hence, the use of RPKMs/FPKMs is limited to the visualisation of expression levels, and those should be interpreted in the context of the results provided by the tools mentioned above.

Furthermore, both SwitchSeq and tviz can be used with any normalisation method, as long as the input is provided in the required matrix format. Typically, I encourage people to work with the normalisation method provided within the DESeq2 package, and then divide the normalised counts by the feature length if the goal is to compare across genes. This approach is fully compatible with SwitchSeq.

I've now updated the documentation for both SwitchSeq and tviz to not refer exclusively to the RPKM method, and I've clarified this in the tutorial. In addition, I have changed the name of the slot 'rpkm' from the TranscriptExpressionSet object to 'texp'.

I hope this addresses your concerns.

@blahah
Copy link
Author

blahah commented Feb 26, 2014

Thanks for the rapid response, and for your clarification of the actual use of the values within SwitchSeq. It's now clear that SwitchSeq wouldn't have been making incorrect calls on that basis. The updated tutorial encourages good practice, as does the slot name change.

I agree with your suggested normalisation strategy in general.

Concerns addressed in full!

@blahah blahah closed this as completed Feb 26, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants