Scaling Maintenance #49

mrocklin · 2020-05-05T18:36:58Z

Over the past six months the amount of interest in contributing to Dask has grown considerably. We should think about how to best harness and organize this work to improve the project and community needs.

Historically

Historically Dask maintenance was handled by a smallish group of around four people, most of whom were under the same employer (Continuum/Anaconda) and so had a shared management chain. This group was small enough that everyone knew what was going on, and product/project management was the work of an individual (mostly me at the time), which made sense since we were all in the same corporate structure.

Today

Today we have something like five-to-ten organizations showing up. We've tried a loose anarchic product/project management structure for a while in the form of

A rotating facilitator every week (lovingly called czar)
A "what does everyone want to work on" assignment policy
With occasional "it would be great if someone could help over here" peer-management (which is great to see)

Issues (from my perspective)

This approach is nice because it democratizes out management in a friendly way. No one is "in charge" which reflects the employment reality. It feels community centric, which feels good to me.

However there are, I think, a few issues:

Because individuals don't track the entire project they don't have a good sense on what is important to work on. As a result, people usually self-select issues that seem easy to them. I get the sense that people are happy working on harder problems if they're more important, they just don't have this information.
There is no onboarding process for new engineers who join the group
There is no speciation of labor. Some tasks require in-depth Dask expertise. Some require finicky programming and problem solving skills. Some require emotional sensitivity and communication skills.
Handling larger or harder projects is rare. This is appropriate given that our original goal is to handle the onslaught of issues, but maybe now we have enough dev-power to handle some larger issues.
Maintenance issues still get left behind. We're inconsistent about handling old issues and PRs, which often get left behind.

I get the sense that with more structure we could get a lot more done, which would be exciting, and might grow this team even further.

Some options

Two extremes in project management might be the following:

The current model, which I'll call "anarchy"
The original model from years ago, which I'll call "dictatorship", where someone with full view over the project tells people what to do. Historically this was me, but today could also be someone like @TomAugspurger or @jrbourbeau who both seem to track all of the issue trackers and have a good holistic view of what's going on in the project.

Personally I don't think that either anarchy or dictatorship are appropriate today. I think that something in between probably makes sense. For example, I'll propose "federation" (I've been watching Star Trek recently) where we might split off into teams, each with its own czar. These teams might assemble for something like a month at a time and have clear direction by a single individual. They might meet weekly (or whatever makes sense for them) and then at the end of the month report out work done in a blogpost (responsibility of Czar) and at the monthly meeting.

One of these teams would always be issue/PR triage, and in some cases the triage team might ask another team to roll something into their work. I suspect that a team of 4-5 people is about right for this task, especially if there is a single Maintenance Czar for that month who is comfortable tracking everything and managing work.

The topics for other teams might come out of a monthly planning issue, followed on by the monthly meeting. Here are some example topics that might serve as a good month-long project for a small team.

High level query optimization
Array performance with task fusion, as applied to both Xarray problems and optimization problems
Plan, deploy, and analyze a user survey
Generalize the Dask-XGBoost pattern so that it can be more effectively used by others
Profile and reduce scheduler overhead
Improve Dask + Tensorflow or Dask + PyTorch integration. Both by surveying users, fixing bugs, and maybe making a small package
Dask + Timeseries. Survey users, fix bugs, maybe make some small package
Outreach. Give a tutorial and have office hours every week over some web conference to the general public
Stability. The team tries to break the Dask scheduler, and then tries to fix whatever it broke
Work with the Napari community to make sure that their system is as smooth as possible

Teams

Regardless of what we choose, I think that there are two parts of the plan above that I like

Teams: we're at a point where having 10 people in a weekly call probably doesn't make sense. I think that breaking people into smaller teams is probably better for organization, and helps to develop comraderie. I think it's probably also good for onboarding if we are intentional about mixing newer and older maintainers, and about mixing people from different companies.
Monthly rather than weekly cycles. I think that switching czar on a weekly basis means that things get dropped, and that people tend not to develop a sense for what's going on broadly within the project.

Corporate Engagement

Some of the topics above correspond to pain points by certain companies. As two examples:

Stability. The team tries to break the Dask scheduler, and then tries to fix whatever it broke. This is important for many, but Blue Yonder in particular
Profile and reduce scheduler overhead. This is important for many, but NVIDIA and Pangeo in particular

We might invite corporate involvement, where they dedicate more resources than typical, but we mix maintainers in to the effort in order to act as spirit guide and make sure that the community side is smooth.

jcrist · 2020-05-12T15:24:55Z

An unorganized jumbling of thoughts:

I agree that given the current resources we have available for dask maintenance, we should try a new model (or maybe several new models) of organizing to see if we can be more efficient.
I agree wholly with the 5 issues you listed (under "Issues (from my perspective)").
The dask ecosystem (much like the Jupyter ecosystem) has gotten too big for any one person to know all of it deeply. Having a general czar (or czars) seems like the wrong model at this point. Any one person will have blindspots in at least some of the projects, so triaging and reviewing issues/PRs for those projects will be tricky. Specialization and subteams seems like the natural solution to this. As a project distributed is sufficiently separated from dask that a dask dev may never need to think about distributed's internals (and vice-versa).
We don't have a good way for onboarding new devs. Having an explicit team of people who manage the community issues/PRs, and are there to mentor new devs through issues (and suggest good issues) would be a best-case scenario in my mind. In a perfect world I might commit 3-4 people to 8 hours a week of working on bugs, triaging issues, reviewing PRs, and mentoring new devs through work. I personally find PR reviewing time to be bimodal - some PRs are very quick to review, and others take 20-30 min. Coupled with handling new issues, and working asynchronously with others, this is a full-time job. I personally spent 6 hours yesterday solely reviewing PRs and issues.
At the same time, we'll want to organize some collective large projects, probably managed by smaller teams. The projects you listed seem like good examples of the kind of work we may be able to accomplish.

A few concrete suggestions, just for discussion (these may just be restating much of what you said above, apologies):

Maintenance Team

We allocate a team of people to handle "maintenance" (prioritizing in order: reviewing user PRs, responding to issues, mentoring new devs, and fixing core bugs (triaged by this team alone, not the larger collective)) for a longer period of time. Instead of having a team swap in and out, I suggest we allocate people on a cycle that offsets with others, so any week will have some people from the previous week, and some new people. If we had a team of 4, we might have the a new person rotate in and out each week. We'll want some level of specialization on this team, so no one person is responsible for watching all repos.

Working Groups

We also split into smaller "working groups". We might decide what's critical on a bi-weekly cycle, and then split up to work in teams between bi-weekly meetings. People in the maintenance team might have opinions on any 2-week chunk of work, and should be welcome to comment, but they shouldn't be expected to do any of the work while they're on maintenance duty.

On-boarding

New devs interested in helping out with core maintenance should start with a stint on the maintenance team. We'll want to ensure we have 1-2 experienced devs handling maintenance at any one time. They'll help mentor the new devs through work, reviewing PRs, and suggesting issues to work on.

Meetings

We'll naturally want to change our meeting schedule to adjust for the new arrangement. Given the above, I'd suggest:

The maintenance team meets twice a week for short (< 15 min) stand-up style meetings. This is to help organize the maintenance cycle and discuss issues.
The working groups self organize as needed
Larger group meeting every 2 weeks, where everyone reports what they've been working on, perhaps new larger tasks are allocated, and larger group issues are discussed.
One monthly meeting where the larger community is welcome to participate. This is the same as the existing monthly meeting.

martindurant · 2020-05-13T13:27:38Z

I would like to propose a parquet/filesystems working group, especially given the recent developments within arrow. We should meet to create a common approach and requirements, and decide how to do any work at our end. I nominate the following people (but others are welcome to raise their hands): @rjzamora (who has been most active in parquet in recent times), @jcrist (who delved into arrow for ORC and other things), @TomAugspurger (who has maintained file-systems things) and myself.

mrocklin · 2020-05-13T14:16:35Z

We met yesterday and had a conversation about this topic. The result was that for the next couple weeks we're going to try splitting into groups in the following way:

Issue triage and bug-fixing: @jcrist @martindurant @jsignell Dan Kerrigan and @jacobtomlinson (half time)
High level query optimization for dataframes @TomAugspurger @gforsyth
Survey other communities for what they do and publish a blogpost about it @jjhelmus
Floating @quasiben @mrocklin

We'll meet next week to report in, but assuming that things are going ok we'll keep on this track for another week or two afterwards.

jacobtomlinson · 2020-05-13T15:00:48Z

TL;DR We should use labels to help track everything better. I'm happy to set this up.

I would also like to propose that we lean more heavily on GitHub as a tool to try and facilitate better knowledge transfer of the current state of the world.

A good start would be making better use of issue labels. Looking through the various trackers now it looks like most do not have any consistent use of labels.

Looking to mature projects with many active contributors such as Kubernetes they have a label system which identifies various categories of information, such as the kind of issue it is, what skills are required to handle it, which working group is responsible for it being resolved and how important it is.

This means contributors can quickly find issues which are within their skill set to solve, and maintainers can track important issues which are falling through the cracks.

Therefore I have the following proposal:

We introduce consistent issue templates across all repos which allow users to specify some labels themselves (bug, feature request, etc). All templates would also include a needs-triage label.
The Czar team is responsible for labelling issues correctly and removing the needs-triage label.
The Czar team is also responsible for actively fixing or delegating high priority issues.
All other developers can use the labels to find things to work on.
The weekly/fortnightly meeting would be an appropriate place to discuss increasing the priority of an issue. But the Czar team would also be responsible for reassessing older issues continuously.

This also fits with the idea of having working groups/teams who care about a single topic (e.g filesystems) and also having a czar group who care about high priority items from all topics and maintaining the state of the world. Cutting across in both directions should help avoid things slipping through cracks.

jcrist · 2020-05-13T15:23:41Z

I think that sounds like a good thing to try. I think that kubernetes goes a little overboard with labels, so I'd suggest we start with a small set. Perhaps a few topical labels (e.g. array), a few triaging labels (needs-triage, bug, feature, ...), and a few skill labels (good-first-issue, ...).

mrocklin · 2020-05-13T15:28:06Z

I like the idea of automating `needs-triage` and also maybe `stale` if that's easy to do. That would help avoid the situation where issues/PRs get lost in handoff.

…

On Wed, May 13, 2020 at 8:23 AM Jim Crist-Harif ***@***.***> wrote: I think that sounds like a good thing to try. I think that kubernetes goes a little overboard with labels, so I'd suggest we start with a small set. Perhaps a few topical labels (e.g. array), a few triaging labels ( needs-triage, bug, critical, feature), and a few skill labels ( good-first-issue, ...). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#49 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHK3HTOF3BUUGJ5WCTRRK3Q5ANCNFSM4MZZ5TIQ> .

jacobtomlinson · 2020-05-13T15:30:44Z

I think that kubernetes goes a little overboard with labels

Agreed

Perhaps a few topical labels (e.g. array), a few triaging labels (needs-triage, bug, feature, ...), and a few skill labels (good-first-issue, ...).

Yeah those are good suggestions. I've raised #50 to track this without creating too much of a tangent here.

martindurant · 2020-05-13T15:42:53Z

I would tentatively add "watching stack overflow" to the list of maintainer duties - which is something that has only been done sporadically by the team. As it happens, the list of unanswered questions there is rather large right now. A typical question is probably lower priority than a github issue (often "how do I", as opposed to "this is broken") and the chances of a post being complete or useful is lower; but still, this is one public face of dask.

mrocklin · 2020-05-13T15:54:12Z

@quasiben you were unassigned this week. Want to take a tour of the stackoverflow Dask tag?

…

On Wed, May 13, 2020 at 8:43 AM Martin Durant ***@***.***> wrote: I would tentatively add "watching stack overflow" to the list of maintainer duties - which is something that has only been done sporadically by the team. As it happens, the list of unanswered questions there is rather large right now. A typical question is probably lower priority than a github issue (often "how do I", as opposed to "this is broken") and the chances of a post being complete or useful is lower; but still, this is one public face of dask. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#49 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFGAVF3KAVAAD32O3TRRK5Y3ANCNFSM4MZZ5TIQ> .

dhirschfeld · 2020-05-14T00:50:50Z

Might be a useful reference, at least to get some ideas:
https://robinpowered.com/blog/best-practice-system-for-organizing-and-tagging-github-issues/

An Architecture or Design label might be useful to tag issues which touch on the internals and might require the deep expertise of several maintainers.

quasiben · 2020-05-14T01:03:40Z

@quasiben you were unassigned this week. Want to take a tour of the
stackoverflow Dask tag?

Sounds good

jsignell · 2020-05-14T14:36:11Z

I'm wondering if it might be useful to have a github team that people rotate on and off of. It might help with the issue of wanting to ping the czar but not knowing who it is.

martindurant · 2020-05-14T14:38:11Z

^ yes, please. I, for instance, don't know how our current maintainer team should be working; nor did anyone reply to my suggestion of a parquet/fs group.

jacobtomlinson · 2020-05-14T14:42:57Z

don't know how our current maintainer team should be working

I think you missed the call on this yesterday. Happy to catch you up, let me know.

nor did anyone reply to my suggestion of a parquet/fs group.

We should definitely home some working groups as @mrocklin initially suggested. A filesystems one would be a good one to have.

martindurant · 2020-05-14T14:45:14Z

Happy to catch you up, let me know.

yes, please!

jacobtomlinson · 2020-05-14T14:47:47Z

Ok I'll drop into whereby.com/dask-dev at the top of the hour for 10 mins for a catch up.

rjzamora · 2020-05-14T17:59:05Z

@martindurant - Happy to be involved in a parquet/filesystems working group (sorry for the delay)

TomAugspurger · 2020-05-14T18:23:23Z

I also think that forming working groups for specific topics. Happy to help with the parquet / filesystems one.

…

On Thu, May 14, 2020 at 12:59 PM Richard (Rick) Zamora < ***@***.***> wrote: @martindurant <https://github.com/martindurant> - Happy to be involved in a parquet/filesystems working group (sorry for the delay) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#49 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIW74TMG47AURDA4ME3RRQWPRANCNFSM4MZZ5TIQ> .

martindurant · 2020-05-14T18:25:54Z

Called "IO" for now https://github.com/orgs/dask/teams/io/members

dhirschfeld mentioned this issue May 9, 2020

Tasks are not stolen if worker runs out of memory dask/distributed#3761

Open

jacobtomlinson mentioned this issue May 13, 2020

Add issue labels #50

Closed

jsignell mentioned this issue May 14, 2020

GitHub @dask/maintenance team #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling Maintenance #49

Scaling Maintenance #49

mrocklin commented May 5, 2020

jcrist commented May 12, 2020

martindurant commented May 13, 2020

mrocklin commented May 13, 2020

jacobtomlinson commented May 13, 2020 •

edited

Loading

jcrist commented May 13, 2020 •

edited

Loading

mrocklin commented May 13, 2020 via email

jacobtomlinson commented May 13, 2020

martindurant commented May 13, 2020

mrocklin commented May 13, 2020 via email

dhirschfeld commented May 14, 2020

quasiben commented May 14, 2020

jsignell commented May 14, 2020 •

edited

Loading

martindurant commented May 14, 2020

jacobtomlinson commented May 14, 2020

martindurant commented May 14, 2020

jacobtomlinson commented May 14, 2020

rjzamora commented May 14, 2020

TomAugspurger commented May 14, 2020 via email

martindurant commented May 14, 2020

Scaling Maintenance #49

Scaling Maintenance #49

Comments

mrocklin commented May 5, 2020

Historically

Today

Issues (from my perspective)

Some options

Teams

Corporate Engagement

jcrist commented May 12, 2020

Maintenance Team

Working Groups

On-boarding

Meetings

martindurant commented May 13, 2020

mrocklin commented May 13, 2020

jacobtomlinson commented May 13, 2020 • edited Loading

jcrist commented May 13, 2020 • edited Loading

mrocklin commented May 13, 2020 via email

jacobtomlinson commented May 13, 2020

martindurant commented May 13, 2020

mrocklin commented May 13, 2020 via email

dhirschfeld commented May 14, 2020

quasiben commented May 14, 2020

jsignell commented May 14, 2020 • edited Loading

martindurant commented May 14, 2020

jacobtomlinson commented May 14, 2020

martindurant commented May 14, 2020

jacobtomlinson commented May 14, 2020

rjzamora commented May 14, 2020

TomAugspurger commented May 14, 2020 via email

martindurant commented May 14, 2020

jacobtomlinson commented May 13, 2020 •

edited

Loading

jcrist commented May 13, 2020 •

edited

Loading

jsignell commented May 14, 2020 •

edited

Loading