Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project idea: Create package with master classes for population genetic data #8

Open
zkamvar opened this issue Mar 8, 2015 · 11 comments

Comments

@zkamvar
Copy link

zkamvar commented Mar 8, 2015

As mentioned in #4, we have a lot of different classes for handling population genetic data and they are all useful in their own right. One thing to think about is the fact that while the representation of the actual genetic data might be different (frequencies vs. values vs. bitwise), there are basic forms of metadata that is common between all of them: population assignment, individual assignment, etc.

Thinking along the same lines as modular synthesizers, I propose to create a package that defines a class or set of classes and methods that will formally define metadata that is often needed for population genetic analysis. This would allow for easier construction of conversion functions between future classes and result in a more consistent workflow between packages.

I realize that this might fall under the problem of proliferation of standards, but I believe that if we design these to be modular, it should not be an issue.

@smhoban
Copy link
Member

smhoban commented Mar 9, 2015

Are you thinking kind of like a conversion engine, along the lines of PGDspider (http://www.cmpg.unibe.ch/software/PGDSpider/#Introduction) or Create (https://bcrc.bio.umass.edu/pedigreesoftware/node/2), but of course within R? I think that would be great and I think others would be interested too. Allan Strand and I have talked a bit about this before.

@zkamvar
Copy link
Author

zkamvar commented Mar 9, 2015

I was thinking more along the lines of the data representation within R as opposed to flat file format (That's not to say different file format handlers in R are not needed). Generally along the lines of what Bioconductor emphasizes for future package development:

Re-use existing S4 classes and generics where possible.

By creating a core set of classes that can be built upon, future developers can ensure interoperability within R. For example, I have created a class that contains the genind object from adegenet. These are still valid genind objects, so all of the methods associated with genind objects are also associated with the genclone objects and I didn't have to re-invent the wheel in terms of creating new methods to compute things like expected heterozygosity.

Additionally, Bioconductor has a long presentation discussing the use of S4 classes and methods.

@thibautjombart
Copy link
Contributor

I think it is essential indeed to reuse existing class wherever possible.
S4 is nice as inheritance is possible, and makes classes easier to change
too.
Zhian, in your example, I am not sure adding a level of hierarchy to @pop
justifies what is effectively a new class. If this problem is general
enough (feedback from the community will be useful here), the simplest
course of action would be adding a new slot to the genind class.

Some classes have been around and used for a while, meaning they roughly do
the job. I think we can build upon them, update what is necessary and
favour interoperability. Metadata can be anything and everything, and will
to some extent depend on the type of data - DNA sequences, allele
frequencies, phylogenetic trees will have their peculiarities.

On Mon, Mar 9, 2015 at 8:23 PM, Zhian N. Kamvar notifications@github.com
wrote:

I was thinking more along the lines of the data representation within R as
opposed to flat file format (That's not to say different file format
handlers in R are not needed). Generally along the lines of what Bioconductor
emphasizes http://www.bioconductor.org/developers/package-guidelines/
for future package development:

Re-use existing S4 classes and generics where possible.

By creating a core set of classes that can be built upon, future
developers can ensure interoperability within R. For example, I have
created a class that contains the genind object from adegenet
https://github.com/grunwaldlab/poppr/blob/master/R/classes.r#L95-L99.
These are still valid genind objects, so all of the methods associated with
genind objects are also associated with the genclone objects and I didn't
have to re-invent the wheel in terms of creating new methods to compute
things like expected heterozygosity.

Additionally, Bioconductor has a long presentation discussing the use of
S4 classes and methods
http://www.bioconductor.org/help/course-materials/2010/AdvancedR/S4InBioconductor.pdf
.


Reply to this email directly or view it on GitHub
#8 (comment)
.

@zkamvar
Copy link
Author

zkamvar commented Mar 10, 2015

My example actually does what you suggest and adds new slots onto the genind class (the @hierarchy slot is used to set up the data and feed it into the @pop slot). Admittedly, my initial suggestion is a bit of a lofty goal and, pragmatically, building off of the existing classes would be the thing to do (besides, adegenet already contains the modular virtual classes: gen, popinfo, and indinfo).

Perhaps an alternative would be to construct a short tutorial for future developers that outlines the following:

  • a list of the different classes and the type of data they are good for
  • why data classes are necessary and useful
  • examples of utilizing inheritance to add new functionality
    • in S4 classes
    • in S3 classes
    • from S3 to S4 classes

The goal for either direction is to encourage future developers to contribute while maintaining interoperability between the packages and lowering the activation energy needed to do so.

Thoughts?

@warnes
Copy link

warnes commented Mar 10, 2015

Hi Everyone,

I feel it is absolutely essential to provide a 'referece' R object class to
store raw genetics data and annotations along with appropriate tools to
import/transfor/export between this and common data formats.

In 2004, the I and the other members of R-Genetics project (
https://sourceforge.net/projects/r-genetics/) developed the GeneticsBase
package for BioConductor for this purpose. The basic code is now somewhat
date, but should serve as a good foundation a modern
update/reimplementation.

One of my desires for the Hackathon is to revive, update, and extend
GeneticsBase and the other R-Genetics project packages, building
appropriate tools to integrate with the current crop of genetics tools,
both R and stand-alone.

The source code for all of the R-Genetics packages is available in the
SourceForge CVS repository at
http://r-genetics.cvs.sourceforge.net/viewvc/r-genetics/.

-Greg

On Mon, Mar 9, 2015 at 3:09 PM, Sean notifications@github.com wrote:

Are you thinking kind of like a conversion engine, along the lines of
PGDspider (http://www.cmpg.unibe.ch/software/PGDSpider/#Introduction) or
Create (https://bcrc.bio.umass.edu/pedigreesoftware/node/2), but of
course within R? I think that would be great and I think others would be
interested too. Allan Strand and I have talked a bit about this before.


Reply to this email directly or view it on GitHub
#8 (comment)
.

"Whereas true religion and good morals are the only solid foundations of
public liberty and happiness . . . it is hereby earnestly recommended to
the several States to take the most effectual measures for the
encouragement thereof." Continental Congress, 1778

@thibautjombart
Copy link
Contributor

Hi there,
most classes are documented already in their respective packages, but a
document providing an outline of the different classes, their structures
and accessors would surely be useful. I think this was Emmanuel's idea as
well. While useful for everyone, I think the emphasis should be more on the
users than on the developers though. Usually, contributors / package
developers seem to be OK figuring out class contents.

On Tue, Mar 10, 2015 at 5:11 PM, Zhian N. Kamvar notifications@github.com
wrote:

My example actually does what you suggest and adds new slots onto the
genind class (the @hierarchy https://github.com/hierarchy slot is used
to set up the data and feed it into the @pop https://github.com/pop
slot). Admittedly, my initial suggestion is a bit of a lofty goal and,
pragmatically, building off of the existing classes would be the thing to
do (besides, adegenet already contains the modular virtual classes: gen,
popinfo, and indinfo).

Perhaps an alternative would be to construct a short tutorial for future
developers that outlines the following:

  • a list of the different classes and the type of data they are good
    for
  • why data classes are necessary and useful
  • examples of utilizing inheritance to add new functionality
    • in S4 classes
    • in S3 classes
    • from S3 to S4 classes

The goal for either direction is to encourage future developers to
contribute while maintaining interoperability between the packages and
lowering the activation energy needed to do so.

Thoughts?


Reply to this email directly or view it on GitHub
#8 (comment)
.

@hlapp
Copy link
Member

hlapp commented Mar 10, 2015

The source code for all of the R-Genetics packages is available in the
SourceForge CVS repository at
http://r-genetics.cvs.sourceforge.net/viewvc/r-genetics/.

Would it be worth converting that to Git?

@emmanuelparadis
Copy link

Hi,
All these discussions going on make me think that we have already a lot of good stuff for population genetics in R. So yes I agree that a "synthesis" of the available information in a friendly way would be great.

@grunwald
Copy link

I agree with all the great posts. Synthesis of available tools in a primer/wiki and move to github would be great.

@peterdfields
Copy link

+1!

@warnes
Copy link

warnes commented Mar 11, 2015

Yes, absolutely.

I'm very time constrained this week because of family health issue, so it would be a great help of someone could assist with doing this.

Actually, the most recent Code for these packages is probably in the BioConductor svn tree.

Change your thoughts and you change the world.
--Dr. Norman Vincent Peale

On Mar 10, 2015, at 4:27 PM, Hilmar Lapp notifications@github.com wrote:

The source code for all of the R-Genetics packages is available in the
SourceForge CVS repository at
http://r-genetics.cvs.sourceforge.net/viewvc/r-genetics/.

Would it be worth converting that to Git?


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants