Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change term - MaterialSample #314

Closed
Jegelewicz opened this issue Jan 22, 2021 · 140 comments
Closed

Change term - MaterialSample #314

Jegelewicz opened this issue Jan 22, 2021 · 140 comments
Labels
Class - MaterialSample Controversial The solution for the issue has not reached a consensus. Extensions normative Task Group - Material Sample https://www.tdwg.org/community/osr/material-sample/ Term - change

Comments

@Jegelewicz
Copy link

Jegelewicz commented Jan 22, 2021

Change term

From https://dwc.tdwg.org/terms/#materialsample

MaterialSample info
Definition A physical result of a sampling (or subsampling) event. In biological collections, the material sample is typically collected, and either preserved or destructively processed.
Examples A whole organism preserved in a collection. A part of an organism isolated for some purpose. A soil sample. A marine microbial sample.

From https://dwc.tdwg.org/terms/#livingspecimen

PreservedSpecimen info
Definition A specimen that has been preserved.
Comments  
Examples A plant on an herbarium sheet. A cataloged lot of fish in a jar.

Given the above, we propose that MaterialSample should be more specific to something less than what might be considered a "voucher" in order to delineate it from PreservedSpecimen.

  • Proponents (who needs this change): Arctos Working Group

Proposed new attributes of the term:

  • Term name (in lowerCamelCase): MaterialSample (no change)
  • Organized in Class (e.g. Location, Taxon):
  • Definition of the term: A physical result of a subsampling event. In biological collections, the material sample is typically collected as a subsample from a preserved or living organism, and either preserved or destructively processed. In geological and environmental collections the material sample is typically collected as a subsample of a larger geologic or environmental construct.
  • Usage comments (recommendations regarding content, etc.):
  • Examples: A part of an organism isolated for some purpose. A tissue sample. A soil sample. A marine microbial sample.
  • Refines (identifier of the broader term this term refines, if applicable): None
  • Replaces (identifier of the existing term that would be deprecated and replaced by this term, if applicable): http://rs.tdwg.org/dwc/terms/version/MaterialSample-2018-09-06 (added by @tucotuco)
  • ABCD 2.06 (XPATH of the equivalent term in ABCD or EFG, if applicable): DataSets/DataSet/Units/Unit (added by @tucotuco)

Note: all of the above is my interpretation of the Arctos Working Group conversation.

@deepreef
Copy link

I do not agree with this proposal. I think a better approach is to embrace MaterialSample as currently defined, and instead alter the various "BasisOfRecord" terms that are represented as "pseudo-classes" (my term) in DwC.

I do agree that the definition of MaterialSample does need refinement and clarification (especially with respect to the boundary between an instance of Organism and and instance of MaterialSample [I have thoughts on this], but I do not agree that the scope of MaterialSample be fundamentally altered to imply "sub"sampled material.

I feel strongly that the DwC class MaterialSample should retain its original definition in the broader sense, to represent the entire spectrum of what we used to refer to as "CollectionObjects" -- that is, inclusive of single whole-organism specimens, derivatives of such (as proposed above), and also aggregates of such (lots, soil/water samples with multiple taxa, rocks with multiple embedded fossils and/or freshly collected and encrusted with [recently] living organisms, etc.)

In summary, I agree with the need and justification for a change in DwC to reconcile these terms, but I think the main change should be in the terms PreservedSpecimen, FossilSpecimen, LivingSpecimen. Instead of representing these as distinct classes that are mutually exclusive with respect to each other and to MaterialSample, I think it makes much more sense to regard these three terms as, in effect, subclasses of MaterialSample (mutually-exclusive alternatives of each other, but all within the scope of MaterialSample), and likewise either deprecate the terms HumanObservation and MachineObservation (as discussed at the recent TDWG, the distinction between them is fuzzy at best), or treat them as subclasses of a general Observation class, which itself is mutually exclusive with respect to MaterialSample.

@tucotuco tucotuco added Class - MaterialSample Controversial The solution for the issue has not reached a consensus. labels Apr 16, 2021
@dshorthouse
Copy link

Agree with @deepreef here. The DINA consortium is in the midst of modelling this and have come to the realization that a catalogued object (= Physical Specimen, Physical Entity) is an instance of a MaterialSample. It may be derived from other instances of MaterialSample (destructively or non-destructively) and may equally produce one or more instances of yet other MaterialSamples as strictly expressed here by @Jegelewicz. And to take this further, Occurrence terms like catalogNumber, otherCatalogNumbers, associatedSequences, and preparations would be better placed under MaterialSample because these have nothing to do with an Occurrence.

@deepreef
Copy link

@dshorthouse : 100% agreement on all of this. We likewise came to the exact same conclusions (including the move of catalogNumber, otherCatalogNumbers, associatedSequences and preparations from Occurrence to MaterialSample).

Obviously it must be true what they say: Great minds think alike. (Or, perhaps, feeble minds think alike? Probably both, and the challenge is figuring out which this represents...)

Incidentally, I would add to the list disposition - as this seems to be more of a property of the physical specimen than the Occurrence instance at which it was extracted from nature.

Another conundrum is how to apply individualCount. As defined, this is clearly a property of Occurrence, but we need a similar property to track number of "units" (for lack of a better term) comprising an instance of MaterialSample as well. The word "individual" harkens back to the old (now deprecated) individualID, which has been replaced by organismID in the Organism class. But we also have organismQuantity, which seems more specific to "The number of individuals represented present at the time of the Occurrence" ("A number or enumeration value for the quantity of organisms."). So... not sure if we can re-purpose individualCount to be something that applies to instances of MaterialSample in this context; or if we need some other way of tracking the "units" of particular instance of MaterialSample. Perhaps this is best handled via MeasurementOrFact instances? There are two separate issues (here and here) about this going on right now...

So many questions....

@dshorthouse
Copy link

dshorthouse commented Apr 22, 2021

And yet another conundrum. What is going on with preparations, especially if it were moved into MaterialSample where it probably belongs? Is it a noun, a verb or a gerund? The examples provided could be interpreted as instances of MaterialSample, aggregations of instances, methods employed to produce them, or descriptions of their preservation media or vessel(s). However, the expectation is a singular materialSampleID, which means we should be obliged to make sense of all the relationships among instances of MaterialSample that share a common provenance by using ResourceRelationship. Some of those relationships will be between MaterialSamples and some of those relationships will between be MaterialSamples and Occurrences, the latter comparable in spirit to what we do with basionyms and their relationship(s) to downstream taxon concepts. And, it's from that particular link that we uncover the collecting event details.

@Jegelewicz
Copy link
Author

@dustymc

@dshorthouse
Copy link

dshorthouse commented Apr 22, 2021

If we extend these realizations to their logical conclusion, we have a problem in how we expect our specimen-based data to be interpreted in the context of an Occurrence. Most (all?) of our aggregators make heavy use of occurrenceID as the canonical anchor for our physical objects that we in the museums community implicitly model as MaterialSample. For us, we're forced to equate occurrenceID and materialSampleID when we share data whereas they are not the same thing. An Occurrence speaks more to an ephemeral, epistemological origin (i.e. basisOfRecord) from which may be derived evidence of past existence manifested as MaterialSample.

The exchange networks of duplicates distributed among herbaria is a concrete example of this. One plant clipped into five pieces, prepared and mounted, each sheet then shipped 'round the world to 5 herbaria. In reality, that's one Occurrence and five MaterialSamples although the participant herbaria have no functional mechanism to produce & share precisely that same progenitor occurrenceID. What they have are their own catalogNumber(s) and vague signals like recordNumber that there was once a unitary Occurrence: one organism at a particular place at a particular time. At present, each herbarium independently creates and attaches a transcribed collecting event (globally plural, locally unique) then shares their data anchored to occurrenceID (globally plural, locally unique) and we solve the problem through yet more abstraction by deploying AI and crafting some clusters with fuzzy edges. But... we're still left with globally plural and locally unique occurrenceIDs for a unitary Occurrence in this example.

@Jegelewicz
Copy link
Author

the participant herbaria have no functional mechanism to produce & share precisely that same progenitor occurrenceID. What they have are their own catalogNumber(s) and vague signals like recordNumber that there was once an Occurrence.

This is a long-standing problem and not just for herbaria. Mammal occurrences end up at different institutions or collections when skins, skeletons and genetic material get separated over the years.

@Jegelewicz
Copy link
Author

Jegelewicz commented Apr 22, 2021

See ArctosDB/arctos#1966 for another side of the story

@dustymc
Copy link

dustymc commented Apr 22, 2021

I believe Arctos has all of the "pigeonholing problems" mentioned in this thread.

https://arctos.database.museum/guid/UAM:ES:4588 seems to meet some definitions of "FossilSpecimen" and PreservedSpecimen, and is also cataloged as https://arctos.database.museum/guid/UAM:Mamm:53942.

Many things in herbaria are "LivingSpecimen" pending a little water and sunlight.

catalogNumber and otherCatalogNumbers seem closer to Occurrence than MaterialSample to me, but we could easily map through one more denormalization. (We do have "MaterialSample otherCatalogNumbers" but I don't think they're exposed via DWC.)

https://arctos.database.museum/guid/MVZ:Egg:10460 is more or less another example of "rocks with multiple embedded fossils."

Observation class, which itself is mutually exclusive with respect to MaterialSample.

We have "there was never a physical part" and "someone says there were physical parts, but they are permanently unavailable for various reasons." I do not see much functional distinction.

So many questions....

Yep!

@dshorthouse
Copy link

dshorthouse commented Apr 22, 2021

This is a long-standing problem and not just for herbaria. Mammal occurrences end up at different institutions or collections when skins, skeletons and genetic material get separated over the years.

Nit-picky, but by "occurrence" here, you mean MaterialSample or specimen. Occurrences don't go anywhere. There may have been a single Occurrence - a single organism collected in a single event. But, the parts - the MaterialSamples - are now scattered among many homes. They all have a relationship to that original Occurrence (perhaps through a parent MaterialSample that no longer exists eg carved up in the basement of the Smithsonian from a previously documented MaterialSample) but there are barriers to knowing it, agreeing on it, using it, and then sharing it.

@Jegelewicz
Copy link
Author

Nit-picky, but correct.

@albenson-usgs
Copy link

So eDNA are MaterialSamples and not Occurrences? Is it both? When is something not an occurrence? Because eDNA have associatedSequences and isn't all of this wrapped up in the occurrence core anyway? So what does it really mean practically for a term to be "placed under MaterialSample"?

Personally I think a larger community discussion needs to happen around basisOfRecord and what its intended to convey. I field a lot of questions in the OBIS and GBIF US communities about this term because it's required and has a controlled vocabulary so data providers and managers have to apply it and it isn't really clear how a downstream user will interpret it.

For the Machine Observations TDWG group, especially for biologging data we are using basisOfRecord to distinguish between observations of an animal where the animal is in hand and having a tag placed on it (HumanObservation) versus the subsequent observations of that animal by a machine (MachineObservation).

@Jegelewicz
Copy link
Author

Jegelewicz commented Apr 22, 2021

Another issue we have grappled with - ArctosDB/arctos#2075

or not finished grappling with....

@dshorthouse
Copy link

dshorthouse commented Apr 22, 2021

@albenson-usgs If we're strict about the definition of an Occurrence then yes, eDNA is an agglomerative MaterialSample. The event portion of the Occurrences (plural) to which that initially single sample is linked is immediately knowable but the organisms (plural) that were bulk sampled may not be.

As for the practicality of where terms are placed in the DwC classes, it has to do with the operational identifiers we attach to these items and what is their cardinality within our collection management systems. If catalogNumber is a property of an Occurrence then that assumes a 1:1 relationship between it and an occurrenceID - they are operationally the same. However, if several different specimens (or their derivatives) each with a different catalogNumber are derived from a single Occurrence with its single event then we may have a problem because under some conditions, we may need to break the cardinality. In other words, GBIF wants a unique occurrenceID but I've got 10 catalogued items that were derived from a single Occurrence so I cannot make them unique and still adhere to the definition of an Occurrence unless I only publish one of them. If I buck the definition and give all of them then I have to make artificial occurrenceIDs, which may mean loss of functional collaboration across collections or across institutions if there was intent to share & reuse those occurrenceIDs. And, GBIF's value is diminished. As many of you have noticed, GBIF now has a clustering algorithm at play for occurrence records. Is it not the intent here to collapse all those disparate, artificially unique occurrenceIDs into canonical Occurrences? If it isn't, then what's the point? Why force us to make these occurrenceIDs unique? Some of us have already done that clustering!

@campmlc
Copy link

campmlc commented Apr 22, 2021

Agree with @dshorthouse. This is highly relevant, as my institution is in the process of setting up an environmental sample/eDNA repository in Arctos, similar to an existing repository at the University of Alaska Museum of the North (https://arctos.database.museum/SpecimenSearch.cfm?guid_prefix=UAM%3AEnv). We are considering including all derived taxonomic IDs and genetic sequences under a single catalog number, as having all been derived from the same occurrence (water sample, soil sample). Alternately, we could catalog each unique taxonomic OTU separately, and link it back to the originally source catalog item via url relationships. The latter is entirely feasible but much more complex, especially if there are hundreds of OTUs that result from a single eDNA sample. What we really need is a way to designate the original source sample, e.g. the water or soil, with a unique source identifier similar to an dwc:organism ID.
Also, our collections have many different examples of catalog items that represent multiple occurrences. These catalog items usually include multiple material samples, e.g. multiple tubes of blood and serum collected from the same animal at different occurrence events . These situations are not hypothetical.

@deepreef
Copy link

Hokay... where to begin? (Note to @timrobertson100: Now is the time to go get that cup of tea...)

So, I first climbed into this rabbit hole several years ago, when I started minting materialSampleID identifiers for our specimens. Initially, at least, these had a 1:1 correspondence with occurrenceID values, as presented through DwC. At the time, we had no resources to conduct a major overhaul of our (homegrown) specimen data management systems, but it did trigger a conceptual odyssey that I've been wandering through ever since.

DarwinCore began as a way for the Museum community to share data about preserved specimens (fun fact: the term is credited to Allen Allison, who apparently blurted it out by mistake when he meant to say "Dublin Core" at a ZBIG meeting - or so he tells me). Thus, the original implied basisOfRecord is what we now refer to as PreservedSpecimen. Soon thereafter, it was assumed that the most valuable data extraction from our specimens was in terms of representing points on a map (i.e., distributions of taxa across geography). Non-vouchered observations also represent points on a map, so the implied basisOfRecord was expanded to accommodate what we now refer to as HumanObservation (and in a few cases at the time, what we now refer to as MachineObservation. Accordingly, the core class/term in DwC was changed to Occurrence, as a more general way of representing points on a map.

Somewhere along the way, what we used to think of as "specimens" now became "occurrences", as if they were congruent concepts. But of course, specimens are physical entities with all sorts of properties important to the people who care for them (such as preparations, disposition, etc.), whereas (as @dshorthouse already noted) occurrences are ephemeral things, capturing the abstract idea of an Organism being present in the context of an Event. When MatieralSample was first proposed, it was not (as I recall) an effort to reconcile this logical incongruity. Rather, it was proposed initially to accommodate multi-taxon "gatherings" (e.g., soil, water), which at the time were the basis for the growing notion of eDNA. After some hashing and thrashing on the email discussion forums, the Class was born and now bears the definition "A physical result of a sampling (or subsampling) event. In biological collections, the material sample is typically collected, and either preserved or destructively processed.", and the examples are: "A whole organism preserved in a collection. A part of an organism isolated for some purpose. A soil sample. A marine microbial sample.". In that context, it's kinda hard not to equate MaterialSample with "Specimen".

(I trust @tucotuco or @stanblum or someone else active in early DwC activities will correct any errors in this historical synopsis...)

I've continued to stare at my ceiling late at night (more often than I should probably admit) pondering the essence and meaning of MaterialSample in the context of other DwC classes, but it's gotten a bit more "real" for me recently. We suddenly have a lot more resources to support the digitization of our collections (and, of particular interest for me, integrate collections data and research data more effectively), and so what had been an entirely intellectual exercise to occupy time late at night, in the shower, stuck in traffic, etc., has now become a very specific practical issue for me. Over the next 4 months, I will be updating the core data model behind our collections data, and one of the specific issues that our CMs need to "fix" is the way we track physical objects in our collections -- i.e., as instances of MaterialSample. Indeed, I recently reached out to @tucotuco

A lot of the discussion above focuses on the boundary between Occurrence and MaterialSample. While I agree that is relevant to the extent that many content providers present "specimen" data as instances of Occurrence, an some have therefore (mistakenly, in my view) equated the two concepts, it's also the easy one to deal with. Instances of MaterialSample very clearly represent physical things preserved in collections, whereas instances of Occurrence represent abstract facts concerning the presence of an instance of Organism at an instance of Event. You don't have to go too deep into the conceptual weeds to grasp the fundamental difference between these two concepts.

Much more challenging (for me, at least), is defining the boundary between MaterialSample and Organism. The way I conceptualize an instance of Organism (which intersect with instances of Event via instances of Occurrence, and with instances of Taxon via Identification), is as a conceptual entity (with physical manifestation) that essentially begins when a sperm meets an egg (or when a single-cell organism divides, or whatever mechanism of reproduction is relevant), passes through all manner of metamorphoses over space and time, and then "ends" at some point. One of the key questions is: what marks the end of the existence of an Organism? The two most obvious candidate answers are: death, and disintegration.

This distinction (death vs. disintegration) comes into play when trying to understand the boundary between an instance of Organism, and an instance of MaterialSample. And this is where my intellectual meanderings keep bumping into a wall. In fact, I recently exchanged a series of emails with both @tucotuco and @baskaufs , primarily to aske the question (among others): Is the TDWG community ready to wrestle with this question? and On what forum should that wrestling take place? Both questions seemed to be answered in the preceding posts on this issue (i.e., "Yes", and "Here").

This post is already too long (even by my standards), so rather than regurgitate all my thinking on this, I'll close by providing a use case, and some follow-up questions.

Use case:
A bird is flying across a field, and while traversing a road, gets hit by a car. The driver pulls over, recognizes the bird as something interesting, and contacts the local Museum. The dead bird is then brought to the Museum and given to the VZ CM, who photographs it, assigns a catalog number to it, writes out a label of the pertinent details, and sticks it in a freezer. Some time later, the bird is removed from the freezer, thawed, and prepared for long-term preservation. In keeping with standard protocol, the skin is removed and preserved following one set of protocols, some tissue samples are taken and preserved following another set of protocols, and the remaining tissues are separated from the skeleton and the bones preserved following yet another set of protocols. By traditional practice at the Museum, the same Catalog Number issued to the whole bird is applied to the three separate sets of objects (Skin, Tissue, Skeleton), and one "Specimen" record is created in the database to record all the pertinent information.

I think most would agree that the living bird flying across the field is an instance of an Organism, and that its unpleasant encounter with the car as if flew across the road constitutes an Event, and together this Organism+Event intersection represents an instance of Occurrence. I suspect that most people would also agree that the three preparations derived from that Organism instance represent three instances of MaterialSample.

That's the easy part. But here are the questions to consider:

  1. When did the first MaterialSample instance come into being? The moment the bird encountered the car and it died? The moment the whole bird arrived at the Museum? When it was assigned a catalog number? When it was placed in the freezer (i.e., "preserved")? When the three preparations were created?
  2. Related to this, was the whole bird in the freezer an instance of MaterialSample, serving as a "parent" of the three derived MaterialSample instances (Skin, Tissue, Skeleton)? (perhaps suggesting the need for a new term parentMaterialSampleID?)
  3. Did the instance of Organism cease to exist when the bird's heart stopped beating? Does it continue to exist as a physical entity after the three preparations are created (and after the remaining tissue material disintegrates)? Does it continue to exist as a conceptual entity until all of the physical matter that comprised it fully decomposes?
  4. Related to the above: what is the semantic relationship between an instance of MaterialSample and an instance of Organism? Something like "isDerivedFrom"?

I have my own thoughts on answers to these (and other) questions, but obviously this post is already WAY too long!

Note: several more posts came in as I was writing this, and I continue to agree 100% with the assertions of @dshorthouse.

@campmlc
Copy link

campmlc commented Apr 22, 2021

@deepreef Regardless of the persistence of the Organism, the "identifier" associated with this organism absolutely has to persist as a linking parent identifier with all subsequent derived parts and preservations, material sample or otherwise, including and especially, parasites and tissues and sequences and media that are deposited other collections and institutions and repositories, to track these back to the source organism and occurrence. This is also true for source/parent material such as soil/water etc for eDNA, which technically is not an "organism" but which also has the same need to track parent/child relationships from a source collection object and occurrence.

@deepreef
Copy link

deepreef commented Apr 22, 2021

@campmlc :

Regardless of the persistence of the Organism, the "identifier" associated with this organism absolutely has to persist as a linking parent identifier with all subsequent derived parts and preservations, material sample or otherwise, including and especially, parasites and tissues and sequences and media that are deposited other collections and institutions and repositories, to track these back to the source organism and occurrence.

I ABSOLUTELY agree! I was focused more on the conceptual entity of the Organism instance. We need to understand what the "thing" is before we can correctly represent the semantic/cardinality relationships between a digital record (and identifier) representing an Organism, and the other digital identifiers we mint for other classes of "things".

@albenson-usgs
Copy link

albenson-usgs commented Apr 22, 2021

We are considering including all derived taxonomic IDs and genetic sequences under a single catalog number, as having all been derived from the same occurrence (water sample, soil sample).

I'm not following this. An occurrence is the observation of a taxon at a place and time. What you are talking about here (to me) is an event. The occurrences are the OTUs or taxa that you detected in the event. For me I can't understand how this is one occurrence. This is an event with many occurrences.

What we really need is a way to designate the original source sample, e.g. the water or soil, with a unique source identifier similar to an dwc:organism ID.

I don't understand why you wouldn't use an eventID for this. The event being a sample of water collected at a place and time.

@albenson-usgs
Copy link

If we're strict about the definition of an Occurrence then yes, eDNA is an agglomerative MaterialSample

Ok but the associatedSequences is basically the identification of the occurrences. Using @deepreef's logic above it's the intersection of Taxon via Identification that tells you there was an Occurrence which is part of an Event.

@dshorthouse
Copy link

dshorthouse commented Apr 22, 2021

This is an event with many occurrences.

Aha! What I think you mean here is an event with many "things". You can't have an occurrence without an event - they are inextricably linked. It is equally incongruous to imagine an occurrence with many events unless we invoke quantum entanglement. The nut @deepreef is getting us to crack is what are these "things"? We could call them MaterialSamples and there appears to be reason for doing so especially when split & scatter protocols are employed.

@albenson-usgs
Copy link

What I think you mean here is an event with many "things".

But the things are not MaterialSamples because the material sample is also the event (a sample of water, a sample of soil). The many "things" are many different sequences that tell us multiple taxa were present at that place and time (or nearby at least).

@Jegelewicz
Copy link
Author

An occurrence is the observation of a taxon at a place and time.

Yep - https://dwc.tdwg.org/terms/#occurrence

This is an event with many occurrences.

I agree - the trouble is, we don't do a good job of this. Events don't have identifiers that are shared by everyone and it is VERY easy to end up with multiple interpretations of a single event.

@deepreef
Copy link

deepreef commented Apr 22, 2021

@albenson-usgs :

Ok but the associatedSequences is basically the identification of the occurrences. Using @deepreef's logic above it's the intersection of Taxon via Identification that tells you there was an Occurrence which is part of an Event.

Yes -- this is another one of the conundrums. Per DwC definition of associatedSequences:

  • A list (concatenated and separated) of identifiers (publication, global unique identifier, URI) of genetic sequence information associated with the Occurrence.

This is another example of a term currently nested within the Occurrence class, that doesn't belong there. The question is: where does it belong? An argument can be made that the sequences are really associated with the Organism that held the genome from which the sequences were derived. Another argument can be made that the sequences are associated with the MaterialSample, extracted from the Organism, from which the actual sequence was created. But I don't think a case can be made that associatedSequences are properties of an Occurrence. Unlike properties like sex, lifestage and others, which change over the course of the ever-changing essence of an Organism over the course of its existence (and therefore need to be anchored to a moment in time, or captured in the form of an Event), the DNA sequences derived from an Organism are the same across its lifetime.

But that does not address your point, which brings in Taxon and Identification. The latter is the intersection between the former and an instance of Organism. As such, a DNA sequence can serve as "Evdience" (not yet a DwC class -- but perhaps there is a need for it) for a taxonomic Identification, but it is not, strictly speaking, the Taxon itself (nor the Identification itself). Going a step further, I wouldn't call the sequences themselves instances of MaterialSample; I think their more analogous to an Image or other form of multimedia, essentially serving as some sort of "representation" of the Organism, derived directly from a MaterialSample (e.g., tissue sample, or water/soil sample).

But the things are not MaterialSamples because the material sample is also the event (a sample of water, a sample of soil). The many "things" are many different sequences that tell us multiple taxa were present at that place and time (or nearby at least).

I don't view samples of water or soil as "Events", any more than I view specimens as "Events". This diagram of Darwin-SW is very helpful, I think, in showing the semantic relationships among many of the core DwC classes. Unfortunately, it doesn't include a node for MaterialSample (the purpose of my previous epically-long post was an attempt to start figuring out exactly where MaterialSample would fit in this graph -- my current thinking is somehow embedded within "Token", aka "Evidence").

In summary, Events are independent of any Organism (or derivatives of organisms). They are essentially a moment in space-time (intersection of Location with a timestamp, plus some other properties). An Occurrence is the intersection of an Event and an Organism. The intersection of an Organism and a Taxon is an Identification. By my thinking, all the other stuff we traffic in (PreservedSpecimen, FossilSpecimen, LivingSpecimen, HumanObservation, MachineObservation, MaterialCitation, etc.) all represent forms of "Evidence" that support either the truth of an Occurrence instance, or the veracity of an Identification instance; but also are intrinsic things that exist independently of these evidentiary roles.

I would definitely conceptualize PreservedSpecimen and FossilSpecimen as examples of MaterialSample; but it's less clear to me whether LivingSpecimen is best framed as an instance of MaterialSample, or Organsim, or both.

Food for thought: take my use-case of bird, and imagine that prior to flying across the field and being hit by a car, it lived in a Zoo. Was it a MaterialSample when it was in the Zoo, before it escaped, flew across the field, then got hit by the car? If so, was it the same instance of MaterialSample before it ended up in the Museum freezer? And does it matter whether it was conceived and born in the Zoo? What if it was collected in the wild?

What I'm trying to get at is the "essence" of an instance of MaterialSample -- ultimately to define it, but even before that, I'd like to know it when I see it (with apologies to Justice Potter Stewart).

@dshorthouse
Copy link

The many "things" are many different sequences that tell us multiple taxa were present at that place and time (or nearby at least).

This is an interesting one & permit my adventurous thought experiment. What if your eDNA sample came from a river? And, after the data are worked-up, you get "cougar" as a hit among all the other microorganisms. What the heck? Turns out, through radio collar data, you discover that there's another Occurrence record captured that clearly shows a cougar on the river bank some time prior to you scooping your water sample. She cut her gums on a fish she was eating. You could argue that, through calculating the speed of water & rolling back the clock, the Occurrence records represented by your eDNA and that of the radio collar data are precisely the same. They are merely lines of evidence, derived from precisely the same event and precisely the same animal. It's just that the motions of the water added a bit of noise to your appreciation of time. Now what? Surely we need something to differentiate what we have. One record was derived from a radio collar & one was derived from eDNA but, crucially, we do REALLY want to make the joins between these things because there's a story. Is there one Occurrence here or two?

@deepreef
Copy link

With all this agreement it might be a good time to start thinking about the scope of a Task Group. ;-)

Indeed! And I guess that would start with someone stepping up to lead it (...he writes, as he quickly crouches down behind the desk and starts crawling towards the back exit... ;-) )

Seriously, though -- I would be delighted to get up at 3am (or whatever time it is in Hawaii) every single week to actively participate in such a Task Group (yes, seriously!), but I am absolutely not the right person to lead it (unless the intention is to guarantee that it languishes).

@tucotuco
Copy link
Member

tucotuco commented Jun 10, 2021 via email

@Jegelewicz
Copy link
Author

Sigh, says the person who started this whole thing.....

I have never lead a TDWG task group, so first I have to figure out the rules/process....

@smrgeoinfo
Copy link

A MaterialSample is a physical object, intended to be representative of some physical thing in the world of interest to someone, on which observations can be made. In this view, a physical object is thing composed of matter that has some defined boundaries.

@dr-shorthair
Copy link

Arriving very late in this discussion ...

I wonder if it would be helpful to be more clear about the use of, and distinction between, the terms sample and specimen.

  • 'sample' relates to the relationship of the thing to hand with a broader context. A 'sample' is usually a small or manageable thing which stands for (is designed to be representative of) a larger or less accessible thing. A sample is collected for the purpose of making observations. Samples are often ephemeral.
  • a 'specimen' (in the GLAM context at least) is a preserved and curated thing. It has an ongoing utility, often but not necessarily to act as a sample of something larger.

Some material samples are preserved and curated, so they are also specimens. Some samples are not.

@smrgeoinfo has alluded to these issues a couple of times above.

@smrgeoinfo
Copy link

smrgeoinfo commented Jun 17, 2021

@dr-shorthair -- that distinction between sample and specimen is new to me, but I think it works (sorry... what is GLAM context?). If I understand, the idea is there can be samples that are not specimens, and specimens that are not samples. Some specific examples would help, particularly for a 'specimen' whose utility is not dependent on its relationship to some context (feature of interest) in the world.

@dr-shorthair
Copy link

GLAM - Galleries, Libraries, Archives and Museums.

This clarification of 'specimen' was given to me by Dimitris Koureas.

@deepreef
Copy link

In my mind, many of these words do not have clear definitions, and so I lump them (including "specimen" and "sample") as 1:1 synonyms with MaterialSample. I think there may be value in establishing specific subclasses of MaterialSample, but there are various ways they can be framed. The distinction between 'specimen' and 'sample' described by @dr-shorthair would parse instances of MaterialSample according to their completeness as a "thing", such that something like 'whole specimen' would be branded as a 'specimen', and something like 'tissue sample' (a small sampling from a whole specimen) would be branded as a 'sample'. I think that's a step in the right direction, but there are other considerations as well.

First of all, we'd need to add a third subclass along the lines of 'aggregate', to represent those instances of MaterialSample consisting of multiple 'things' (e.g., a lot of multiple whole specimens). However, other instances of MaterialSample might represent aggregations of 'samples' (e.g., eDNA samples). Still others might represent a mixture of 'things' and 'parts of things' (e.g., a water sample that might contain whole-organism plankton and larvae, as well as fragments of tissue and DNA).

Second, there are other axes around which instances of MaterialSample might be parsed. (e.g. dwc:preparations). It's not clear to me whether these might represent another set of subclasses for MaterialSample, or would be better represented in some other way using different properties to represent the different characteristics along these axes of each MaterialSample instance.

Third, the word 'specimen' has different meanings that run counter to the distinction of 'specimen vs. 'sample. For example, most botanical specimens are called "specimens", but they fall squarely into the definition of 'sample` as stated above. Or... maybe I misunderstand the scope of what things are larger things, vs. extracted bits of larger things?

@dr-shorthair
Copy link

I suppose pretty much every 'specimen' in a museum or gallery is representative of an artists body-of-work, or a school-of-art, or a civilization, or similar, and are likely to be catalogued in this way.

In SSN/SOSA the predicate is sosa:isSampleOf and this is pretty much the only required property of a sosa:Sample. Of course in RDF under the OWA a property instance may be missing from a specific dataset or graph, even though the Ontology says there must be one. That might be because the thing that it is a sample of is not yet clear.

I'm not at all saying that you should abandon your current terminology - I know that sample/material-sample/specimen are often used synonymously, and I used the word 'Specimen' as a synonym for 'material sample' in ISO 19156 (O&M v2) (I would change that if I had a do-over). I'm just suggesting that it is worth recognising that sampling and curation are separate concerns, and that thinking about these concerns separately might help with some of the definitional challenges.

@deepreef
Copy link

I'm just suggesting that it is worth recognising that sampling and curation are separate concerns, and that thinking about these concerns separately might help with some of the definitional challenges.

I wholeheartedly agree with this, and I think this is in many ways the crux of the discussion that needs to follow with respect to the Task Group. I had written a summary of my take on the key issues related to the Task Group scope and agenda (#358), but I apparently either posted it in the wrong place, or perhaps failed to post it entirely.

@smrgeoinfo
Copy link

smrgeoinfo commented Jun 18, 2021

From usage point of view, the questions might be
A. What is the thing that is getting bound to the identifier ( I hope the intention of 'MaterialSample'). Might be:

  1. An original object as it was collected in the field ('sample') (object + collection event+ sampledFeature ) (see SpecimenType for proposed iSamples Categories);
  2. A physical object that has been curated (preserved, mounted, given an identifier...) and accessioned to a repository for long term preservation ('specimen'); (object + 'preparation'(curation, preservation) event(s), with or without collection event + sampledFeature);
  3. A data object that is a representation of a physical object that might be 1 or 2 from above. (what MIDS is working on).

Is there something else (besides sampling procedures, responsible parties, preparation procedures, locations) that we need to identify in this domain?

There is a different (but overlapping) information model for each of these 'resources' ( kinds of things).
B. What are the required information elements to document these resources (#1, 2, 3 above)

@elywallis
Copy link

Having (eventually) read (most) of the contributions to this discussion, and agreeing with many of them, I'd just like to come back to users and what will result at the front end of aggregators where these data wind up.
Starting with @deepreef 's reminder that:

DarwinCore began as a way for the Museum community to share data about preserved specimens

I'd like to bring us back to some real questions we get from users of aggregators or staff at our institutions
curator A: "How do I just see the specimens?"
curator B: "I don't want any observation records, and no environmentalDNA records either"
curator C: "Dr X (researcher) rang me up and wants to know what tissue samples we have for [insert taxon name here] so that they can request some for DNA analysis. How does Dr X do that?"

Discussions like these (whilst interesting) can easily drift into such complexity that we lose sight of the fairly straightforward ways in which users may want to interact with the data. The "I just want to ... " usecase.

Regardless of how this discussion ends up, can I put in a plea for the least-skilled user here? Or even for the user who is, perhaps, an ecologist or a population modeller or a government bureaucrat who "just wants some data". They might not know to look for "material sample" when they just want to get data about some specimens in a particular collection; or might not be too concerned about whether there's a boundary between an organism and a taxon because they just want to find that feather, and they know the record is there somewhere. Or for the collection manager who isn't going to sit and read 128 comments to find out how to map the things they call "specimens" out of their collection management system and into DarwinCore?

We might not be there yet, but could we please eventually circle back to how will real-life users navigate these concepts in GBIF, or ALA, or wherever, so that users can intuitively find what they're looking for?

@deepreef
Copy link

Very well said, @elywallis !!! I completely agree. Indeed, a large part of my interest in this topic (MaterialSample) is to address exactly the sorts of real questions you pose. I think the answer to all three questions (and many others) rests with some sort of logical alignment of properties like preparations, disposition (both of which I have argued should be organized as properties of the MaterialSample class), materialSampleType, and parentMaterialSampleID (among others).

While I have probably been more guilty than anyone else in terms of endlessly waxing philosophical/conceptual, my real motivation here is driven by very practical needs. We at Bishop Museum are in the early stages of an informatics renaissance, including the harmonization digitization and data management among several major natural history collections as well as cultural and library/archives collections.

A great deal of what I hope the Task Group will accomplish is to help me get to a place where our data management system can easily answer exactly these kinds of real-world questions.

One final note in defense of the philosophical/conceptual perspective, though: I have learned over the years that constructing solutions optimized for solving real-world problems that are right in front of us sometimes (often?) accomplishes short-term gain in exchange for long-term pain. Having endured a great deal of that pain, I can see how thinking this stuff through carefully can lead to the development of systems that not only easily answer the questions that are right in front of us, but also the many countless questions we haven't even thought to ask (...yet). [That is...short-term pain in exchange for long-term gain...]

@Jegelewicz
Copy link
Author

I woke up in the middle of the night thinking about how DarwinCore is NOT structured for collection management or sample discovery. The two comments above make me feel that even more.

@dshorthouse
Copy link

dshorthouse commented Jun 25, 2021

I woke up in the middle of the night thinking about how DarwinCore is NOT structured for collection management or sample discovery. The two comments above make me feel that even more.

Very true! And the irony is, when creating a collections management system from scratch, one of the (many) considerations designers try to accommodate is, "How does this translate to Darwin Core?" Rolling-up relational concepts into a philosophical Occurrence is not always an easy thing to do, especially when there are layers of derivative objects not all of which participated directly in a collecting event. The other tension is the very common situation where derivative objects have determinations that differ from that of the parent (eg barcode BIN applied to the DNA derived from a severed leg vs identification by a person applied to the once whole beetle using morphological characters, both of which may be independently shared by different organizations as components of what looks like two Occurrence records).

@tucotuco
Copy link
Member

Darwin Core is not structured. When we give courses on Darwin Core we try to hammer this in. It is a bag of terms that we hope to define well enough so that it can be reused in lots of contexts (including to define fields in databases). The confusion arises because the most popular (but not only) way to share data "in Darwin Core" is through Darwin Core Archives and the very few structures supported through "cores" an "extensions" defined by XML files on rs.gbif.org.

@Jegelewicz
Copy link
Author

When we give courses on Darwin Core

I think I need one of these....

@tucotuco
Copy link
Member

I think I need one of these....

I would begin at Chapter 0 on https://github.com/tdwg/dwc-qa/wiki/Webinars.

@deepreef
Copy link

With the caveat from @tucotuco that DwC is not intended to be structured (not entirely true, or it wouldn't have defined classes)...

Rolling-up relational concepts into a philosophical Occurrence is not always an easy thing to do, especially when there are layers of derivative objects not all of which participated directly in a collecting event.

So... even though DwC was born in the context of sharing specimen data, I think the community at the time felt that the most valuable output from sharing/aggregating specimen data was the ability to document the occurrence of organisms in space and time (specifically, via collecting event data associated with those specimens). In that context, the large body of unvouchered observational records (especially birds) represented an opportunity to add additional "meat" to these patterns of organisms occurring in space and time. Thus, the notion of a "specimen" as the basis of record became Occurrence.

I think that was a step in the right direction. However, the part that never sat well with me was the notion of Specimen=Occurrence, which seemed to pervade TDWG-land for a number of years, and underpins the point raised by @dshorthouse quoted above.

Later, with the near-simultaneous introduction of MaterialSample and Organism classes, I think DwC took another step in the right direction towards dis-entangling assertions about the presence of organisms in space and time from the "evidence" used to support those assertions. I've rambled on enough about that earlier in this thread, so no need to repeat here.

Unfortunately (but predictably), there was a period of several years when most TDWG-folk weren't quite sure how best to implement the concepts represented by these two new DwC classes/concepts (especially given that one of them -- MaterialSample -- was actually intended for a slightly different function than what it ended up being defined as).

This very unusual year (for a few reasons) in DwC is, I think, in part a result of some sort of "awakening" among a critical mass of data content providers in how the "power" of these new classes (MaterialSample and Organism) can be leveraged through an evolution in how data exchange mechanisms built on DwC can function. Basically, this comes back to that goal of disentangling Occurrence instances from the evidence that supports them.

The other tension is the very common situation where derivative objects have determinations that differ from that of the parent (eg barcode BIN applied to the DNA derived from a severed leg vs identification by a person applied to the once whole beetle using morphological characters, both of which may be independently shared by different organizations as components of what looks like two Occurrence records).

Indeed, Identification does play a critical role in this -- because historically TDWGers have used Taxon as a proxy for Organism; when in fact, Taxon is a non-objective property of an Organism. My contention is that Organisms participate in Occurrences, and these Organisms are asserted to represent members of a particular Taxon through one or more Identification instances. And, as I've expounded previously in this thread, "evidence" underpins both instances of Occurrence and instances of Identification.

So, in the example from @dshorthouse above, a single Organism might have more than one derived MaterialSamples -- e.g., the "parent" whole specimen (linked to a collecting event, and thus serving as evidence of an Occurrence), and the "child" tissue sample. The "parent" MaterialSample might serve as evidence to apply an Identification of one Taxon to the Organism from which the whole specimen was derived, and the "child" tissue sample might result in a DNA sequence that serves as evidence in support of an Identification to a different Taxon. If the link between these two MaterialSamples (either the parent/child relationship, or the fact that both are derived from the same Organism) is not clearly maintained, then as @dshorthouse notes these will likely end up as two distinct Occurrences, rather than (correctly) a single Occurrence based on an Organism with more than one asserted Taxon Identification....

phew... OK, I obviously allowed my philosophical/conceptual waxer out of his cage for longer than I probably should have, but there you go.

@deepreef
Copy link

deepreef commented Jun 25, 2021

The previous post was way too long, so I decided to break this out as a follow-up.

I wanted to call out a different "tension" that I think underpins a lot of the conflict/confusion on this issue: The tension between the needs of collection object managers, and the needs of biodiversity researchers. Obviously, many of us are primarily focused on managing physical objects in a collection (MaterialSample-centric perspectives), and many of us are primarily focused on the biodiversity research aspects of DwC data (Occurrence/Identification-centric perspectives). A few of us have one leg firmly planted in each camp. Two main obstacles must be overcome to achieve success for both camps using the same data exchange standard:

  1. We need to understand how the data should be modelled in ways that support both sets of needs; and
  2. We need information management software that leverages such a robustly functional data model.

My sense is that these and other recent discussions are getting us close to (1), which is what all the philosophical/conceptual mumbo jumbo is about. After that reaches some sort of stable(ish) asymptote, then the real challenge of evolving our information management software (2) begins. My sincere hope is that we can make real progress on (1) without breaking existing software and protocols -- but I've often been accused of being overly optimistic.

@Jegelewicz
Copy link
Author

@deepreef this is exactly what I have been thinking - see ArctosDB/arctos#3630 (comment)

@afuchs1
Copy link

afuchs1 commented Jun 28, 2021

@deepreef

I wanted to call out a different "tension" that I think underpins a lot of the conflict/confusion on this issue: The tension between the needs of collection object managers, and the needs of biodiversity researchers. Obviously, many of us are primarily focused on managing physical objects in a collection (MaterialSample-centric perspectives), and many of us are primarily focused on the biodiversity research aspects of DwC data (Occurrence/Identification-centric perspectives). A few of us had one leg firmly planted in each camp.

A great summary of the issue :)

@tucotuco
Copy link
Member

This issue has been superseded by #451

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Class - MaterialSample Controversial The solution for the issue has not reached a consensus. Extensions normative Task Group - Material Sample https://www.tdwg.org/community/osr/material-sample/ Term - change
Projects
None yet
Development

No branches or pull requests