Week of March 3rd. In Person. University of Illinois, Champaign.
There will likely be 1-2 opportunities preceding and following the core workshop for zoom sessions. They will be listed here if so.
Contact Matt Yoder or Debbie Paul with questions.
Participation is largely set with logistics being worked out now. If you are interested in learning moore you should certainly reach out, contact Matt or Debbie. SFG is contributing funding to the logistics of the meeting, though most has been accounted-for. Larger followups, facilitated by SFG and perhaps NSF workshop money are of interest, again, reach out.
At the conclusion of the workshop we hope to have a hybrid session to communicate with our broader communities. This need not be based on the workshop activities. Details remain to be worked out, but we anticipate either a moderated round-table discussion, and/or 15-20 minute presentations, not to exceed an hour, with time for conversation in between. Max 2 well-padded hours.
Driven by a long-running set of use-cases and requirements largely pertaining to TaxonWorks related requests from Donald Hobern, and a need to firmly set-aside time to address these, we blocked out time to focus on the issues. Given ongoing developments at the SFG, and a need to align with other global efforts both technical, and cross domain (e.g. plants, animals) we've expanded the scope, slightly.
While we have some specific agendas we will come to the hackathon with an open mindset and evolve the work according to some unconference-style discussion to happen on the first day.
There are some well known contexts behind the efforts, the primary focus is their technical aspects:
- "Sharing" taxonomies - People need to effectively draw in data into their work-spaces.
- "Under-utilized" APIs - All the major players now share data in straightforwardly accessible APIs, and all players can quickly adopt those APIs to emerging needs. The utility and potential behind this fact remains, at least in Matt's opinion, greatly underutilized at all levels of use (from individual research to global initiatives).
- "Fractal" interactions - There is a spectrum of work, from reconciling a single taxon name, to creating a globally unique nomenclator. Describing this spectrum and annotating the paths across it with real life use-cases is necessary for us to collectively architect workflows, software, APIs, UIs that facilitate this work. See
Exchanging Names
below. - "Merge and sync" - Given a set of names A, and a set of names B, we want a multi-step, human-in-the-loop process that a) minimizes losseyness; b) minimizes dangerous human decisions; c) maximizes a human's decisions impact; d) integrates or clearly ties into a DSL (domain specific language) that facilitates the work
- "Checklists as macros" - When we compare two rows of data in A and B, we should come up with a deterministic set of steps that results in their "syncing" or "merging". These need not need to result in a perfect merge or sync, they must, however, result in an understood outcome at some stage/step.
- "Latency exists, and that is OK" - We are not banks and trading platforms requiring sub-millisecond latency, for example when a taxonomist updates a name in TaxonWorks it does not need to be globally shared immediately. Accepting latency scales with the size of the dataset (we've spent 250 years or more building the global dataset, and we're not done), and documenting and explaining this removes a lot of pressure from engineering requirements. For another example, is it really necessary to do "Monthly" updates to the CoL? Carefully defining and communicating how often resources are updated lets tools/users comfortably work around these. Coming up with real-world examples of latency impacting biodiversity informatics would almost certainly become a very useful reference. Existing related examples (time from collection to species description) are published on and increasingly well understood. Can we predict upper and lower bounds per process/product?
- "Using the right names" - Curators, scientists etc. claim they want to use the correct names, and the statement "automatically update" is typically used shortly after their scenarios are layed out. Matt strongly believes this simplification is not at all what they want, and that this becomes evident once scenarios are played out in detail. What then do they really want, and how can we better communicate and point this out?
- "Using a backbone" -> See "Using the right names".
In general the challenge is now not API access, it's moving from "native" or "standard" to "native" or "standard". For example:
- ChecklistBank to TW native
- TW Native to WoRMS
- TW Native to Checklist Bank
- ChecklistBank to GBIF Backbone
- Catalogue of Life (=ChecklistBank?) to TW
- Checklist Bank to Specify/Arctos/Symbiota/ALA/custom
- Rhakis to ChecklistBank
- ChecklistBank to DwC Checklist
This can also be abstracted on the axis of DSLs, e.g. a name in native Go to R to Ruby to Python to Java to PHP to Javascript etc.
For example "Aus bus Smith 1920" attached to "root" of the native target.
E.g. a species name, and its genus and higher taxonomy.
Here we assume the name is missing, but the classification may be:
- Completely missing
- Partially missing
- Completely present
E.g. a Genus and its species.
While the simplest use case is everything is missing, we can perhaps assume almost nothing:
- The name may be present/absent
- The children maybe be present/absent.
If there is some synchronization, then we have the "Many names" case.
A matrix of possible use-cases exist based on the following axes:
- The set of name are all valid, valid and invalid, all invalid
- The attachment point is empty, or not (i.e. create or update, aka add or sync)
Global checklists, nomenclators
Here we are thinking both of web-accessible /api/
entry points, REST, JSON etc. and Code-based. E.g. ORMs used in things like TaxonWorks.
TODO: brief links to documentation.
Should likely move these to Issues for discussion and clear this out. These are concrete examples with known/requests real-world examples that if fixed would remove bottlenecks to existing workflows. "Imagined" use cases are welcome, but should be flagged as such.
- Given ~4k Lepidopteran names, that are up-to-date according to the curator's assertion, merge these into and existing, richly annotated set of data that variously overlaps.
- When typing a name in TaxonWorks, and it is not found (e.g. async-autocomplete), smoothly prompt the user to draw it down from CheckList Bank (or another API). From TW -> CheckListBank -> TW. Note that we can share data across TW projects by proxy of ChecklistBank given snapshots are there.
- Request that all children of a name be inputed into TaxonWorks, in context, seemlessly
- Resolve Otu names to both existing TaxonNames in TaxonWorks or if no matches new names
- Consistent, routine builds of ChecklistBank formatted data for CoL exports, from TaxonWorks.
- When initiating a new project in TaxonWorks I want to pull down a set of completely new names into a project. From time-to-time I want to check to see if there are divergences.
- TODO: Borg examples