Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

Migrating to modular software organization #425

Closed
TransGirlCodes opened this issue Mar 20, 2017 · 86 comments
Closed

Migrating to modular software organization #425

TransGirlCodes opened this issue Mar 20, 2017 · 86 comments

Comments

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Mar 20, 2017

It has been requested by the community that the organisation of software in BioJulia should be made more modular than it currently is. Both for the benefit of both users and developers. We are reaching a point in Bio.jl where Travis does not complete, and anyone wanting one module of Bio.jl has to install and compile/pre-compile the rest of Bio.jl.

Below I migrate all conversations over from Gitter, and I outline a draft set of guidelines for maintainers and contributors about the organisation of packages.

@TransGirlCodes
Copy link
Member Author

@mkborregaard

@Ward9250 I had another question I'd like to ask you - is Phylo bound to being a submodule of Bio? It seems to me that phylo does not depend on Bio functionality - so a possible model would be to have Phylo live outside but be imported by Bio. Here is the reason I ask - a lot of my ecology code depends on phylogenies, but uses no other Bio functionality. I don't care (I also use Bio things) but I want the SpatialEcology package to use phylogenies but not incur all of Bio as a dependency. I think this is probably not just me - phylogenies are used in much broader fields than DNA stuff.
I hope it isn't impolite to ask this question here

@TransGirlCodes
Copy link
Member Author

@Ward9250

The same argument can be made for any of the other Bio submodules - for example if one only wanted to use sequences.
Once you start asking that about the different packages the entire purpose of a Bio.jl package get's called into question. And whether modular vs monolithic design is better, and you can make the case for either design pattern - the most successful piece of monolithic software being the linux kernel.

@TransGirlCodes
Copy link
Member Author

@ChrisRackauckas

@Ward9250 @mkborregaard FWIW, DifferentialEquations.jl went the same direction because of these kinds of requests. But it does have more interdependencies, hence DiffEqBase.jl as a common small dependency. And then DifferentialEquations.jl just reexports the functionality from the various packages.

@TransGirlCodes
Copy link
Member Author

@Ward9250

The main argument for Phylo being in Bio, or indeed all of the modules being in Bio is that we can more easily make sure everything in one module of Bio is compatible with another - One of the reasons development of Phylo has been slow is I'm trying to design it at the same time as designing things to go in Var, because my plans fr storing and efficiently working with Terrabytes worth of trees also feed into y plans for working with both genetic variation, and uncertainties in pan-genomic graphs.
Bioconductor has a similar design.
Creating individual packages from Bio.jl submodules wouldn't mean the death of Bio.jl, as it could still exist as a metapackage binding common tools together for installation conveinience.

@TransGirlCodes
Copy link
Member Author

@ChrisRackauckas

and common docs can be really nice.

@TransGirlCodes
Copy link
Member Author

@Ward9250

The question would be how fine scale do we go? Does Bio.Seq, become a nucleotides package, and K-Mer package, and a Biological Sequence package, and so on. We're still one org, so we could still enforce compatibility between the packages.
Indeed the common docs are great.

@TransGirlCodes
Copy link
Member Author

@ChrisRackauckas Mar 18 17:49

I think you can go pretty fine scale
and make all functions work on abstract types
implement an interface
and have the concrete types just be standard implementations of the interface

@TransGirlCodes
Copy link
Member Author

@Ward9250 Mar 18 17:51

To a certain extent smaller packages would be more modular and easy to manage.

@TransGirlCodes
Copy link
Member Author

@kescobo Mar 18 17:53

I'll just speak up here for the bioinformatics newbs - if getting the functionality I need means hunting for a bunch of packages, that's going to dramatically increase the barrier for use. Would it be possible to have a bunch of subpackages, but also allow installation of everything with one call to Pkg.add()?

@TransGirlCodes
Copy link
Member Author

@ChrisRackauckas Mar 18 17:53

yes
that's what the proposal is

@TransGirlCodes
Copy link
Member Author

@Ward9250 Mar 18 17:53

@ALL As one of the maintainers of BioJulia I am willing to give this a go - perhaps leave Bio.jl as it is for now so nobody get's shot in the foot with their PR's, but move some of the stuff out to individual packages and begin transitioning to the alternative organisation, if that is what the people want.

@TransGirlCodes
Copy link
Member Author

@ChrisRackauckas Mar 18 17:53

For reference, the vast majority of DifferentialEquations.jl is implemented in other packages
https://github.com/JuliaDiffEq/DifferentialEquations.jl/blob/master/src/DifferentialEquations.jl
Reexport.jl handles all of it
most users never really know that it's all separately managed
but other developers, when they find out they can just depend on what they need, like it
It seems Bio.jl has a lot less interdependencies, so it might even be easier

@TransGirlCodes
Copy link
Member Author

@kescobo Mar 18 18:03

Well, it sounds like we can get the benefits of modularity without losing much of the benefit of the monolith
What's the main cost to this approach? Is it just on the maintainers to keep track of multiple sites where issues/PRs etc are logged?

@TransGirlCodes
Copy link
Member Author

@ChrisRackauckas Mar 18 18:04

yes
to really maintain it all, you do need to watch all the repos (though the org setup can make that happen automatically)
if there's lots of interdependencies, then sometimes tagging can get a little difficult
i.e. X might need master of Y for tests to pass, which uses Z in tests, which uses X in tests.
It's solved by version limits though
but it sounds like, if Phylo and some things like that aren't as interconnected, they won't have those issues
one really good thing about the change are that the tests run separately though

@TransGirlCodes
Copy link
Member Author

@Ward9250 Mar 18 19:32

So moving forward on the proposal, I'm moving some of the more core interfaces of BioJulia - Exceptions, IO, common functions and so on, are being placed into a module called BioCore.jl
None of this will affect Bio.jl or other packages yet, until Bio.jl has the same code removed, and comes to import BioCore as a dependency.

@TransGirlCodes
Copy link
Member Author

@mkborregaard Mar 18 20:07

Hey friends, thanks a lot for taking such progressive action on my question. I think this is a really cool and positivet development. I know the Plots organisation a little more, and it works similarly - e.g. PlotUtils implements everything that is generally useful for other plotting packages, and Plots just reexports it, so from the perspective of Plots it is all just the same package, but other packages depending on PlotUtils don't need the rest of Plots. And it does cause some Travis issues at times as @ChrisRackauckas says, if a change needs to be rolled out across repos at the same time (though this only happens when really basic functionality or logic is changed).
I really believe this will be experienced as a great development by contributors and dependents alike

@TransGirlCodes
Copy link
Member Author

@Ward9250 Mar 19 02:08

Following on from that conversation - heres an example, taking nucleic acids from Bio.Seq and making them a package https://github.com/BioJulia/NucleicAcids.jl as a first "proof of concept", assuming the principle of - if theres a new type introduced, it should be its own package.

@TransGirlCodes
Copy link
Member Author

@jgreener64 Mar 19 16:39

I think this could be a nice idea - Bio.jl is getting a bit monolithic and splitting stuff into separate packages is quite appealing. Having a single Pkg.add() command to get it all is important because we should keep the barrier to entry as low as possible.
However I don't think we should go overboard and have 20 small packages just to get basic bio functionality - this would be confusing and a bit of a nightmare version/dependency situation.

@TransGirlCodes
Copy link
Member Author

@bicycle1885 11:08

+1 for restructuring Bio.j as a set of dedicated packages. However, I think it's too early to send pull requests for METADATA.jl because we haven't confirm anything about the design of modular packages. We are not sure whether this style really works well in BioJulia and which functionalities should be moved to other packages. Let's open an issue on Bio.jl and discuss the plan a little bit further.

@bicycle1885
Copy link
Member

Thank you, @Ward9250.

I think "Bio" is enough informative as the prefix of package names. I do not type so long names in many places. So, BiologicalSymbols.jl should be BioSymbols.jl in my opinion.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Mar 20, 2017

Proposal for BioJulia Ecosystem Organisation Guidelines

Definitions

  1. A first type/method is considered tightly-linked to a second type/method if the first type/method is not useable independent of the second type/method.
    e.g. A BioSequence and a Phylogeny are not tightly linked, you can use either without the other.
    e.g. A Structure.Residue is not useable without a Structure.Atom, and so they are tightly linked.
    Note: This definition only considers the current state of the contribution/code, as you can spend forever speculating on the organisation of "would-be" code.

Organisation of existing Bio.jl code.

  1. The BioJulia Ecosystem will have a certain amount of shared infrastructure. For example, the consistent IO interface, functions that are shared between BioJulia packages (e.g. seqname), and exception types. This core set should go into a package - BioCore.jl.

  2. Each of the submodules of Bio.jl should be moved to their own dedicated package, and should import/using the shared infrastructure from BioCore.jl.

  3. Some of the Bio.jl submodules are larger than others; some introduce one or two data-types, and some introduce a lot more. In this case, the larger submodules should be split into several smaller packages based on the types. (See Definition 1.)

  4. Bio.jl will become an easy-install meta-package which re-exports the new more modular packages. We may have more task or field-of-study dedicated meta-packages.

Organisation of new contributions.

  1. If a new contribution is to add a datatype or method that is deemed to be tightly connected to an existing package in the BioJulia Ecosystem (See Definition 1.), then it can be added to the existing package.

  2. However if it is not tightly linked, then it can be contributed to BioJulia as a new package.

Once agreement on guidelines is finalised, I will upload them to our contribution policy site

@bicycle1885
Copy link
Member

We also need to keep Pkg3 in mind, which is planned to supersede the current packaging system. But I have no idea about the effects of Pkg3.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Mar 20, 2017

It looks like Pkg3 will be a help rather than a harm: registries, environments and so on all look like they will help us when it comes to dependencies and versions.

@jgreener64
Copy link
Member

I think the proposal is on the right track.

Bio.Structure lends itself nicely to being a single package I would say.

@kescobo
Copy link
Member

kescobo commented Mar 20, 2017

Looks good, thanks @Ward9250!

Just because it's the thing I've been working on, I have a clarification on your definition. The MinHashSketch type that is in #415 is independent of other types ATM, but generating them, at least the way I'd use them, depends on eg biological sequence types.

In principle a minhash sketch is extensible, but I'm only interested in developing the package for biological uses. What's the policy there?

@TransGirlCodes
Copy link
Member Author

@jgreener64 Thanks! I suggest having this proposal open for a week for discussion, after which agreed on amendments will be made, and the amended proposal will then be on display for another day, before being added to our policy and made official. I will mark suggestions and amendments made on this thread with a 🎉 when they have been added to the proposal above.

@kdm9
Copy link
Member

kdm9 commented Apr 17, 2017

@kescobo I'm soon planning on writing a bunch of high-level k-mer counting stuff in Julia. Are you interested in pulling out the minhash-based "counter" into a package for kmer analysis? Thouhts all? Would that be too modular?

@kescobo
Copy link
Member

kescobo commented Apr 18, 2017

@kdmurray91 Not sure. Frankly, I'm not 100% clear on where it's dependencies will lie once the split occurs. Based on this question I asked and this response from @Ward9250, it seems like my current implementation would be part of BioSequences.jl, but presumably your kmer counting stuff would also have that as a dependency (that's where kmers are going right?), so in principle it could work.

Happy to discuss further and help assuming the rest of the org agrees. Might make sense to break out into a separate issue?

@TransGirlCodes
Copy link
Member Author

@kescobo, @kdmurray91 Kmers currently have an implementation in BioSequence, so really to know about organisation we should have a more solid idea of what is currently in BioSequence, what is missing, what will be added, and whether or not this requires a separate package.

@kdm9
Copy link
Member

kdm9 commented Apr 19, 2017

@kescobo I'm not 100% sure either. I was intending for such a package to contain a bunch of high-level kmer counting routines, using CountMin Sketches, Minhash sketches, bloom filters, dicts and dense arrays (like composition()). Also routines for counting transitions betwen kmers (treating sequences as discrete markov processes). The actual kmer iteration machinery would stay in biosequences, and this package would depend upon it.

My motivation is that I want to make a package that contains the above, as I feel that all of the above would be a lot to include in BioSequences.jl.

@bicycle1885
Copy link
Member

I agree with @kdmurray91 to separate high-level kmer functions into a separated package if they don't depend on specific data structures of sequences. So, keeping the kmer generator in BioSequences.jl would be sensible.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Apr 19, 2017

@kdmurray91 Kmers in BioSequence currently only inherit from the abstract type Sequence and require the Biosymbol and Alphabet parameters.
And so the separate Kmer package you have in mind, may as well also define the Kmer types too, whilst importing just those two deps from BioSymbols and BioSequences. (Kmers.jl?? :D )

In fact it might be good to make that the pilot to try the less centralised maintainership of packages previously discussed (reminder quoted below) (which I am currently writing for the contribution and community policy site). @kdmurray91 If you are up for that, you would be listed as the designated maintainer of that package and added to the Maintainer's Team.

As BioJulia packages become more numerous. BioJulia members that want to could be deputised as maintainers of specific packages by the admin. One of the duties of the maintainer is then to choose and uphold a branching model for that repo they maintain. Very popular packages could have a few maintainers to ensure coverage of activity. The maintainers have higher level write access to the modules they are maintaining, but not those they do not maintain.

We should probably talk about modularising maintainership in the BioJulia community now anyway: I can especially see this happening in a future where people want to contribute packages and software they've worked on and want to publish to BioJulia, but they also want to remain as maintainers of said package. Furthermore, as the codebase becomes more modular with more and more packages, it will be more difficult for a small group of admins to keep up, and so a more modular maintainership, in which there is a larger group of maintainers responsible for individual packages seems sensible: Members who want to help to maintain BioJulia can then focus on maintaining packages more related to their field and expertise. The smaller group of admins then are there in a guiding role for such maintainers, doing things like and helping resolve issues, enforcing community code of conduct, and keeping things running smoothly.

@kescobo
Copy link
Member

kescobo commented Apr 19, 2017

@kdmurray91 that makes sense to me - let's do it!

@kdm9
Copy link
Member

kdm9 commented Apr 20, 2017

@Ward9250 I'm up for that. I'm not 100% sure that the low level kmer code needs to be in the new package, but lets go with what you said above.

Regarding maintainership, I'm more than happy to, with a couple of caveats: I'm not that experienced with optimising Julia code for performance, and would love to have code I write reviewed by someone a bit more experienced. I'd also like to advocate the "lowNMU" concept from Debian, which in essence gives other maintainers of BioJulia packages freedom to step in as maintainer if they deem in required, e.g. if I have a PhD-induced period of absence. This is a little contrary to the statement that:

The maintainers have higher level write access to the modules they are maintaining, but not those they do not maintain.

@kescobo I think it would make sense to wait till after BioSequences is stabilised, but we can start on the sketching data structures. It might even be worth keeping those generic, and having a separate SketchingDataStructures.jl package. Thoughts?

@kescobo
Copy link
Member

kescobo commented Apr 20, 2017

@kdmurray91 My instinct is to push against excessive splitting - seems like we can try to build the data structures as generically as possible, and then split them off at a later date if there's a compelling reason to. That said, I don't feel strongly about it. Happy to start working on that in the near-ish future, though can't devote more than a few hours a week until the summer when I'll (likely) be starting another postdoc.

At this stage in our community, I think a low threshold for stepping in makes a lot of sense, though probably not worth formalizing. Each package could in principle have their own level of maintainer-control, though we should think hard about setting default values. I think @Ward9250 's suggestion makes some sense as a baseline, though probably only if there are at least two maintainers for any given package, in case one goes incommunicado.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Apr 20, 2017

I think @kdmurray91's suggestion is a good one. So I propose changing

The maintainers have higher level write access to the modules they are maintaining, but not those they do not maintain.

To the following, something like this will go up on the Contributing docs:

Packages have at least one "Dedicated Maintainer" who has admin access to the package.
Typically this will be the original contributor of the package, but popular or larger packages may have multiple such dedicated maintainers, and maintainers may join or leave (see onboarding and offboarding - on the site we'll have a link).
The dedicated maintainer(s) will be responsible for deciding the branching model used and how branches are protected, in addition to reviewing PRs, resolving issues for that package.
As a Dedicated Maintainer named on the package README and BioJulia website, we ask them to expect to often be the first contact for new contributors, community members and maintainers.

All maintainers in the Maintainers team have push access to all code packages in the BioJulia ecosystem.
This allows for a community spirit where maintainers who are dedicated primarily to other packages may step in to help other maintainers to resolve a PR or issue.
As such newer maintainers and researchers contributing a package to the BioJulia ecosystem can rest assured help will always be at hand.
However, if you are a maintainer stepping in to help the Dedicated Maintainer of another package, please respect them by: First offering to step in and help resolve something, and secondly, by asking the Dedicated Maintainer before doing advanced and potentially destructive git operations e.g forcing pushes to branches (especially master), or re-writing history of branches.

@kdm9
Copy link
Member

kdm9 commented Apr 20, 2017

@Ward9250 Perfect!

@kescobo I think that sounds like a good start, like you say we can split it off later if it seems useful to others (which I assume it would be). I have CountMin.jl that will be rolled into this package, and your minhash sketch will go in. There are some issue with BloomFilters.jl, so I was thinking that we could fork that code. I'll continue this train of though in an issue @ kmers.jl.

@bicycle1885
Copy link
Member

@Ward9250 I think it was too early to tag and release BioCore.jl v1.0. We no longer use Ragel so the BioCore.Ragel module should have been renamed and moved to BioCore.ReaderHelper or somewhere. Also, BioCore.StringFields may be no longer needed under the current design of text parsers. We really need to be careful when releasing a version 1.0.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented May 25, 2017

@bicycle1885 We can do that in a fixup or minor release. I don't think it's that radical that version 1 matches the contents of Bio.jl. I'm doing a release of BioCore and a release (even if it's not 1.0) of BioSequences this week as some drastic circumstances have come up at work, and there are things which are going to require them. Long story short, I'm being offered a position which means timelines for my current job have become a nightmare.

EDIT:

Long story short, I'm being offered a position which means timelines for my current job have become a nightmare.

This doesn't mean I'm going anywhere or that my work with BioJulia is stopping, but my PI and academic focuses will be different.

@TransGirlCodes
Copy link
Member Author

Just to update people on progress. BioSequences.jl has been released, providing the features of Bio.Seq in our decentralised software packages. @bicycle1885 is currently working on GenomicIntervals.jl and I'm going to start work on GeneticVariation.jl this week.

@bicycle1885
Copy link
Member

Let me clarify my feelings:

  • Since Julia 0.6 will be released soon (in this month or the next, I hope), it would be better to support 0.6
  • However, Julia 0.5 will be used in many places for a while, I started to think it is not a good idea to drop the support of Julia 0.5 now.
  • So, gradually decomposing Bio.jl into packages as I and @Ward9250 are doing and starting to support Julia 0.6 one by one would be easier and quicker.
  • The last monolithic Bio.jl release may not be needed.
  • We finally release a new Bio.jl after the modular packages are completed.

@TransGirlCodes
Copy link
Member Author

For me the gradual release was necessary to push through based on my non-biojulia work commitments, it seemed to be the most manageable - most contributors we've had are volunteers and the gradual decomposition is easier to find time for. I think all of those points are right and we should go with them.

@TransGirlCodes
Copy link
Member Author

Is there an ETA on 0.6 btw? It seems I see everywhere it's "soon" but I thought it would be released by now - not that I'm complaining, the developers do a fantastic job.

@bicycle1885
Copy link
Member

bicycle1885 commented Jun 14, 2017

No ETA as far as I know. I'm saying "soon" for three months but it is not yet. The latest release is Julia 0.6-rc3 and they occasionally release three or four release candidates before final. So, I believe this "soon" is really soon.

@jgreener64
Copy link
Member

How close to completion is the decomposition? It seems most of the stuff is out in separate packages. Now that v0.6 is out and about it would be good to get it finalised - let me know if there is anything I can do.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Sep 23, 2017

Hey @jgreener64 we are very close now. There's been a lull recently as I've had to go through job interview for the assembly and algorithms development team here in Norwich which I'm glad to say I got so anything cool we make will end up in biojulia in one form or another as I will still be allowed to do BioJulia stuff. Now I am working solidly on coding again now that prep stuff has gone away, I can get back to finishing this process.

@jgreener64
Copy link
Member

Great, thanks for the update @Ward9250 and well done.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants