Project idea: Create package with master classes for population genetic data #8

zkamvar · 2015-03-08T18:57:12Z

As mentioned in #4, we have a lot of different classes for handling population genetic data and they are all useful in their own right. One thing to think about is the fact that while the representation of the actual genetic data might be different (frequencies vs. values vs. bitwise), there are basic forms of metadata that is common between all of them: population assignment, individual assignment, etc.

Thinking along the same lines as modular synthesizers, I propose to create a package that defines a class or set of classes and methods that will formally define metadata that is often needed for population genetic analysis. This would allow for easier construction of conversion functions between future classes and result in a more consistent workflow between packages.

I realize that this might fall under the problem of proliferation of standards, but I believe that if we design these to be modular, it should not be an issue.

smhoban · 2015-03-09T19:09:47Z

Are you thinking kind of like a conversion engine, along the lines of PGDspider (http://www.cmpg.unibe.ch/software/PGDSpider/#Introduction) or Create (https://bcrc.bio.umass.edu/pedigreesoftware/node/2), but of course within R? I think that would be great and I think others would be interested too. Allan Strand and I have talked a bit about this before.

zkamvar · 2015-03-09T20:23:45Z

I was thinking more along the lines of the data representation within R as opposed to flat file format (That's not to say different file format handlers in R are not needed). Generally along the lines of what Bioconductor emphasizes for future package development:

Re-use existing S4 classes and generics where possible.

By creating a core set of classes that can be built upon, future developers can ensure interoperability within R. For example, I have created a class that contains the genind object from adegenet. These are still valid genind objects, so all of the methods associated with genind objects are also associated with the genclone objects and I didn't have to re-invent the wheel in terms of creating new methods to compute things like expected heterozygosity.

Additionally, Bioconductor has a long presentation discussing the use of S4 classes and methods.

thibautjombart · 2015-03-10T10:28:42Z

I think it is essential indeed to reuse existing class wherever possible.
S4 is nice as inheritance is possible, and makes classes easier to change
too.
Zhian, in your example, I am not sure adding a level of hierarchy to @pop
justifies what is effectively a new class. If this problem is general
enough (feedback from the community will be useful here), the simplest
course of action would be adding a new slot to the genind class.

Some classes have been around and used for a while, meaning they roughly do
the job. I think we can build upon them, update what is necessary and
favour interoperability. Metadata can be anything and everything, and will
to some extent depend on the type of data - DNA sequences, allele
frequencies, phylogenetic trees will have their peculiarities.

On Mon, Mar 9, 2015 at 8:23 PM, Zhian N. Kamvar notifications@github.com
wrote:

I was thinking more along the lines of the data representation within R as
opposed to flat file format (That's not to say different file format
handlers in R are not needed). Generally along the lines of what Bioconductor
emphasizes http://www.bioconductor.org/developers/package-guidelines/
for future package development:

Re-use existing S4 classes and generics where possible.

By creating a core set of classes that can be built upon, future
developers can ensure interoperability within R. For example, I have
created a class that contains the genind object from adegenet
https://github.com/grunwaldlab/poppr/blob/master/R/classes.r#L95-L99.
These are still valid genind objects, so all of the methods associated with
genind objects are also associated with the genclone objects and I didn't
have to re-invent the wheel in terms of creating new methods to compute
things like expected heterozygosity.

Additionally, Bioconductor has a long presentation discussing the use of
S4 classes and methods
http://www.bioconductor.org/help/course-materials/2010/AdvancedR/S4InBioconductor.pdf
.

—
Reply to this email directly or view it on GitHub
#8 (comment)
.

zkamvar · 2015-03-10T17:11:33Z

My example actually does what you suggest and adds new slots onto the genind class (the @hierarchy slot is used to set up the data and feed it into the @pop slot). Admittedly, my initial suggestion is a bit of a lofty goal and, pragmatically, building off of the existing classes would be the thing to do (besides, adegenet already contains the modular virtual classes: gen, popinfo, and indinfo).

Perhaps an alternative would be to construct a short tutorial for future developers that outlines the following:

a list of the different classes and the type of data they are good for
why data classes are necessary and useful
examples of utilizing inheritance to add new functionality
- in S4 classes
- in S3 classes
- from S3 to S4 classes

The goal for either direction is to encourage future developers to contribute while maintaining interoperability between the packages and lowering the activation energy needed to do so.

Thoughts?

warnes · 2015-03-10T17:51:53Z

Hi Everyone,

I feel it is absolutely essential to provide a 'referece' R object class to
store raw genetics data and annotations along with appropriate tools to
import/transfor/export between this and common data formats.

In 2004, the I and the other members of R-Genetics project (
https://sourceforge.net/projects/r-genetics/) developed the GeneticsBase
package for BioConductor for this purpose. The basic code is now somewhat
date, but should serve as a good foundation a modern
update/reimplementation.

One of my desires for the Hackathon is to revive, update, and extend
GeneticsBase and the other R-Genetics project packages, building
appropriate tools to integrate with the current crop of genetics tools,
both R and stand-alone.

The source code for all of the R-Genetics packages is available in the
SourceForge CVS repository at
http://r-genetics.cvs.sourceforge.net/viewvc/r-genetics/.

-Greg

On Mon, Mar 9, 2015 at 3:09 PM, Sean notifications@github.com wrote:

Are you thinking kind of like a conversion engine, along the lines of
PGDspider (http://www.cmpg.unibe.ch/software/PGDSpider/#Introduction) or
Create (https://bcrc.bio.umass.edu/pedigreesoftware/node/2), but of
course within R? I think that would be great and I think others would be
interested too. Allan Strand and I have talked a bit about this before.

—
Reply to this email directly or view it on GitHub
#8 (comment)
.

"Whereas true religion and good morals are the only solid foundations of
public liberty and happiness . . . it is hereby earnestly recommended to
the several States to take the most effectual measures for the
encouragement thereof." Continental Congress, 1778

thibautjombart · 2015-03-10T18:46:34Z

Hi there,
most classes are documented already in their respective packages, but a
document providing an outline of the different classes, their structures
and accessors would surely be useful. I think this was Emmanuel's idea as
well. While useful for everyone, I think the emphasis should be more on the
users than on the developers though. Usually, contributors / package
developers seem to be OK figuring out class contents.

On Tue, Mar 10, 2015 at 5:11 PM, Zhian N. Kamvar notifications@github.com
wrote:

My example actually does what you suggest and adds new slots onto the
genind class (the @hierarchy https://github.com/hierarchy slot is used
to set up the data and feed it into the @pop https://github.com/pop
slot). Admittedly, my initial suggestion is a bit of a lofty goal and,
pragmatically, building off of the existing classes would be the thing to
do (besides, adegenet already contains the modular virtual classes: gen,
popinfo, and indinfo).

Perhaps an alternative would be to construct a short tutorial for future
developers that outlines the following:

a list of the different classes and the type of data they are good
for

why data classes are necessary and useful

examples of utilizing inheritance to add new functionality

in S4 classes

in S3 classes

from S3 to S4 classes

The goal for either direction is to encourage future developers to
contribute while maintaining interoperability between the packages and
lowering the activation energy needed to do so.

Thoughts?

—
Reply to this email directly or view it on GitHub
#8 (comment)
.

hlapp · 2015-03-10T20:27:50Z

The source code for all of the R-Genetics packages is available in the
SourceForge CVS repository at
http://r-genetics.cvs.sourceforge.net/viewvc/r-genetics/.

Would it be worth converting that to Git?

emmanuelparadis · 2015-03-10T20:40:39Z

Hi,
All these discussions going on make me think that we have already a lot of good stuff for population genetics in R. So yes I agree that a "synthesis" of the available information in a friendly way would be great.

grunwald · 2015-03-10T21:53:10Z

I agree with all the great posts. Synthesis of available tools in a primer/wiki and move to github would be great.

peterdfields · 2015-03-10T22:02:49Z

+1!

warnes · 2015-03-11T16:19:16Z

Yes, absolutely.

I'm very time constrained this week because of family health issue, so it would be a great help of someone could assist with doing this.

Actually, the most recent Code for these packages is probably in the BioConductor svn tree.

Change your thoughts and you change the world.
--Dr. Norman Vincent Peale

On Mar 10, 2015, at 4:27 PM, Hilmar Lapp notifications@github.com wrote:

The source code for all of the R-Genetics packages is available in the
SourceForge CVS repository at
http://r-genetics.cvs.sourceforge.net/viewvc/r-genetics/.

Would it be worth converting that to Git?

—
Reply to this email directly or view it on GitHub.

zkamvar added the project idea label Mar 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project idea: Create package with master classes for population genetic data #8

Project idea: Create package with master classes for population genetic data #8

zkamvar commented Mar 8, 2015

smhoban commented Mar 9, 2015

zkamvar commented Mar 9, 2015

thibautjombart commented Mar 10, 2015

zkamvar commented Mar 10, 2015

warnes commented Mar 10, 2015

thibautjombart commented Mar 10, 2015

hlapp commented Mar 10, 2015

emmanuelparadis commented Mar 10, 2015

grunwald commented Mar 10, 2015

peterdfields commented Mar 10, 2015

warnes commented Mar 11, 2015

Project idea: Create package with master classes for population genetic data #8

Project idea: Create package with master classes for population genetic data #8

Comments

zkamvar commented Mar 8, 2015

smhoban commented Mar 9, 2015

zkamvar commented Mar 9, 2015

thibautjombart commented Mar 10, 2015

zkamvar commented Mar 10, 2015

warnes commented Mar 10, 2015

thibautjombart commented Mar 10, 2015

hlapp commented Mar 10, 2015

emmanuelparadis commented Mar 10, 2015

grunwald commented Mar 10, 2015

peterdfields commented Mar 10, 2015

warnes commented Mar 11, 2015