Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinates, verbatim coordinates, and data entry #752

Closed
dustymc opened this issue Sep 14, 2015 · 11 comments
Closed

Coordinates, verbatim coordinates, and data entry #752

dustymc opened this issue Sep 14, 2015 · 11 comments

Comments

@dustymc
Copy link
Contributor

dustymc commented Sep 14, 2015

From UAM:Ento:

When we get a specimen that has no original lat/longs we georeference the locality, get the lat/longs and enter them.

But in order to do so we have to select 'decimal degrees' from the original lat/long units field, which is false - the original units were not present at all.

I've long wished that Arctos had some way to indicate if the lat/longs came from the label (and presumably the collector) or were later interpretations (and if later, then who made them?)

The data entry screen (and bulkloader) treat locality coordinates (locality.dec_lat, locality.dec_long) and verbatim coordinates (collecting_event.verbatim_coordinates) as the same thing to simplify usability. That's possible to change, but would require the addition of "duplicate" coordinate fields and several (around 30) extra columns to the bulkloader, which I suspect would come with significant usability issues. I am completely open to better ideas.

Arctos now provides for any number of "locality stacks" (everything between specimens and higher geography), so one way to deal with this is by entering two localities:

  • there (according to collector) [no coordinates]
  • there (according to data entry person, or whatever) [some sort of coordinates]

which is of course twice as much work with what seems to me an insignificant benefit - if you mis-typed (or downloaded from your GPS) the "verbatim" you probably did the same for the "spatial."

We do a good job of keeping track of how they were made (georef protocol & source) but not who made them (or when).

This was present in the old model and was by design excluded from the new. Under the current model (and any GIS system or map), [X, Y +/- Z, incl. datum etc.] is a defined geospatial area; it's a fact, a data object - I can find it on a map or go there or compare it to other areas. The assertion (via specimen_event.assigned_by_agent, specimen_event.assigned_on_date) is "this specimen<--->{specimen_event_type}<--->that place."

Pull 20,000 bugs out of a trap (at the same time, across centuries, whatever, the PLACE is all the same), enter them wrong, you can fix them all by updating one thing here.

Under the old model and what you're proposing, {[X, Y +/- Z, incl. datum etc.] + who/when} is an assertion, or at least a likely duplicate. Nail a GPS to the ground, we both read the same numbers off of it, we have a two "places." You read it, and then do so again a tenth of a second (Oracle's default date precision) later with the same place-results, we have two places. I guess I'm not opposed to that model, but I am very opposed to that model under our current data structure. If any two specimens are exceedingly unlikely to share a place, then why do we need place as a data object at all? If we have a time component to places, why do we have another time component one join away? Why not move it all closer to specimens (something like Attributes) and simplify the model?

Pull 20,000 bugs out of a trap, enter them wrong, you'll need to update 20,000 things here.

The introduction of "verbatim coordinates" to collecting event is an attempt to have both geospatial-capable locality objects and whatever someone scribbled on some label, verbatim. I think anything preventing those two actions is "only" an interface problem. Adding more metadata to the locality node is a modeling problem, and one which potentially completely changes the nature of the data.

@dustymc dustymc added this to the Needs Discussion milestone Sep 14, 2015
@DerekSikes
Copy link

What is missing from the current model is:
"this place name" -> "these coordinates"
The assertion is that such and such a locality name (eg river valley ca. 35km sw Mordor) has "these coordinates." There is a huge area of potential ambiguity and skill involved in making such an assertion. The current model has no means to document who did this or when (but it does capture how it was done, eg GPS, Google Maps, etc.).

@dustymc
Copy link
Contributor Author

dustymc commented Sep 15, 2015

My take (and see http://arctosdb.org/documentation/places/specimen-event/) is that it's the the agent listed in specimen_event's job to determine if "these coordinates" and "river valley ca. 35km sw Mordor" have anything useful to do with each other in the context of a specimen, and simply not use them if they don't.

From a user's perspective, I don't see much difference between "came up with coordinates" and "hooked specimen to coordinates of indeterminate origin." I'm probably missing some curatorial use...

The old model structure had place name as primary data, with coordinates determined (by a person, etc.) from it.

place_name
---->coordinates (accepted_or_not_flag)
---->coordinates (accepted_or_not_flag)
---->...

which is incompatible with....

coordinates (downloaded from my WAAS-enabled FAA-certified GPS)
---->vague and sometimes wrong place name, filled in by someone for some strange reason

Coordinates and placenames are complementary in the current model - they're in the same table and functionally equivalent, part of the same THING.

GEOREFERENCE_PROTOCOL should provide coordinates-from-description vs. description-from-coordinates directionality.

"GPS download" and "GPS transcription" (the best and debatably-second-best source of coordinates) seem to be buried in GEOREFERENCE_SOURCE (with 7K other values) - and some of mine (which were downloaded) are entered as just "GPS" and so aren't distinguishable from the (normal) "transcribed from the transcription in the field notes" data (which have a large error rate). I have no idea what we're TRYING to do with these fields, but I don't think we're doing it.

GEOREFERENCE_SOURCE is obviously acting as a huge denormalizer (what I've been trying to avoid with the addition of a who/when). The data are mostly variations of a very few things (collector did it, found it on a map, MaNIS, GeoLocate). I completely fail to understand how "2007, Google Earth Maps, Europa technologies, Eye alt=11528 ft" is going to allow me to end up with the same coordinates (and that WAS the point; this field was invented for MaNIS), and if it can't do that I'm not sure what it is useful for.

select count(*) from locality;

COUNT(*)

713299

UAM@ARCTOS> select count(*) from locality where dec_lat is not null;

COUNT(*)

575983

UAM@ARCTOS> select count(distinct(dec_lat || dec_long)) from locality;

COUNT(DISTINCT(DEC_LAT||DEC_LONG))

            181937

Our "compromise" model is the "let's denormalize JUST until nobody's happy" model. Let's fix it. I see two ways out:

  1. I'll just add whatever anyone wants to Locality, but I get to drop all pretenses of "duplicate." We accept that these data are denormalized by the addition of metadata and stop pretending that locality_id is anything other than a primary key in a table. (Lots of folks try to use locality_id as a "site identifier.") "Locality Nickname" (actual site identifier) remains, so the people who DO want unique-at-some-scale localities can get them through it. (I'd suggest we merge collecting event and locality - what's the point of a "verbatim" table if everything in the next table upstream is verbatim too? - but I think that would break paleo's multi-year named locality data.)

  2. We normalize. I have no idea what that means from here. Something about a unique index on coordinates/error/depth/elevation with metadata elsewhere or not existing (I still have no idea what we're trying to DO with whodunit or GEOREFERENCE_SOURCE) or something.

We need to consider usability in terms of the bulkloader and data entry screens before we do anything. Can we drop specimen_event_{agent/date} if we have looks-the-same-from-here data in locality, or are those data different (include cultural collns if we have this conversation)? Can we streamline anything else? Does this solve the "verbatim coordinates" issue (by making everything verbatim)? What do users have difficulty with now? (I think just the # of "locality fields" is a major issue.) Etc. Let's don't make this MORE unusable.

If we're going to fix localities, we should also discuss the relationship between placename/coordinates and higher geography. Collected from one "locality," (esp. eg, coastal AK) a fish (seal/etc.) is likely to end up in some sea, a moose in some GMU, a lemming in a quad, a plant in some state park, etc., etc., etc., and that's actively preventing discovery by anyone trying to use higher geography.

See also #739.

@DerekSikes
Copy link

Sorry I don't have time to fully digest all you wrote but I don't think you've addressed the issue. Take any record like this: http://arctos.database.museum/guid/UAM:Ento:118788
A user visits and sees that record. They might wonder if the lat/longs were estimated & confirmed by the collector, or later georeferenced by someone else (and if so, whom?). There is some evidence visible - the specimen was collected in 1981 but the coordinates were obtained by use of Google Earth which clearly wasn't present in 1981. Thus we know it wasn't the collector who assigned them, but who did? There is the line " accepted place of collection assigned by Derek S. Sikes on 2010-10-13 " but isn't that rather ambiguous? What did I do when I assigned this specimen to that place of collection? Did I look at a label and transcribe it perfectly (no added coordinates) or did I do some interpretation and add coordinates, or make other changes? And what about if I visit the locality record after someone else has assigned a specimen to a locality, and I see an error (like an Alaskan place mapping in China, and fix it)? I've now changed the coordinates but there is no record that it was me who did so.

@dustymc
Copy link
Contributor Author

dustymc commented Sep 15, 2015

My argument remains this: The agent listed in specimen_event.assigned_by is responsible for everything in the locality stack. The model can be interpreted no other way. (But read the last paragraph before you sharpen your pitchfork!)

Collector provided descriptive data? Collector should be specimen_event.assigned_by.

Collector somehow came up with coordinates? Collector should be specimen_event.assigned_by.

Student somehow came up with better coordinates? Student should be specimen_event.assigned_by.

Curator tightened up the error? Curator should be specimen_event.assigned_by.

#739 addresses "confirmed by...." (and I think it's a solid long-term solution, whatever we do elsewhere).

Yes, lacking something in verificationstatus "assigned by" is ambiguous. You tossed a dart at your map for all I know - and the same is true for most things - which is why #739 proposes....

unverified
Definition: No assertion regarding the accuracy of the place and time information is made.
Migration Path: No changes.

visit the locality record after someone else has assigned a specimen

If you have access to someone else's collection, they trust you to edit their locality. If you don't have access to their collection, you'll have to split the locality and edit that. (#740 may change that.)

no record that it was me

A specimen can have any number of specimen-events, so just leave the old and add a new if you wish to maintain that history.

The model is pretty rigorous, things like normalization aside - it's hard to find a situation that doesn't work (if you buy into my definitions). BUT, I'm increasingly unsure that it's realistic for anyone to use the thing in a way that actually makes all that work. Doing so would require a lot of specimens having a lot of localities (eg, 4 in the above example), things that can be done with a click or two (update coordinates of unverified localities) should be done with lotsa-clicks (add/edit a new locality), etc. I don't think I can write interfaces to simplify that without introducing some sort of unexpected complications elsewhere. Given those two things, I suggest we back up and re-analyze what sort of locality (in the broadest sense) data we want and what we expect to do with it, then design a model which does that. If that's not possible (and it probably isn't, short-term), I propose we drop some expectations (eg, localities being somewhat-unique) and cram whatever we need to answer whatever questions ya'll want answered into the current model.

@DerekSikes
Copy link

I like this "Collector provided descriptive data? Collector should be specimen_event.assigned_by.

Collector somehow came up with coordinates? Collector should be specimen_event.assigned_by.

Student somehow came up with better coordinates? Student should be specimen_event.assigned_by.

Curator tightened up the error? Curator should be specimen_event.assigned_by."

and it's what we try to do mostly... there are problems with usability though.

If the curator edits the locality record why not have Arctos auto-magically change 'specimen_event.assigned_by' to that agent's name & the new date? Doing this manually when editing lots of locality records just doesn't happen.

@dustymc
Copy link
Contributor Author

dustymc commented Sep 15, 2015

have Arctos auto-magically change 'specimen_event.assigned_by' to that agent's name & the new date?

Magical probably needs to go through the group (might not be a bad default behavior) but...

screen shot 2015-09-15 at 4 26 40 pm

can go out with the next Arctos release.

@DerekSikes
Copy link

that change sounds great to me.

On Tue, Sep 15, 2015 at 3:30 PM, dustymc notifications@github.com wrote:

have Arctos auto-magically change 'specimen_event.assigned_by' to that
agent's name & the new date?

Magical probably needs to go through the group (might not be a bad default
behavior) but...

[image: screen shot 2015-09-15 at 4 26 40 pm]
https://cloud.githubusercontent.com/assets/5720791/9892751/b68c4276-5bc6-11e5-9591-f032967dd517.png

can go out with the next Arctos release.


Reply to this email directly or view it on GitHub
#752 (comment).

+++++++++++++++++++++++++++++++++++
Derek S. Sikes, Curator of Insects
Associate Professor of Entomology
University of Alaska Museum
907 Yukon Drive
Fairbanks, AK 99775-6960

dssikes@alaska.edu

phone: 907-474-6278
FAX: 907-474-5469

University of Alaska Museum - search 302,939 digitized arthropod records
http://www.uaf.edu/museum/collections/ento/
+++++++++++++++++++++++++++++++++++

Interested in Alaskan Entomology? Join the Alaska Entomological
Society and / or sign up for the email listserv "Alaska Entomological
Network" at
http://www.akentsoc.org/contact.php

@mkoo
Copy link
Member

mkoo commented Sep 16, 2015

This is a long thread (still digesting) but I just want to say that I think
Derek pointed out the some of our biggest issues with the current model of
locality--> disassociating the georeferencer versus the updater (usually a
curatorial assistant student or curator vs. the collector. Somewhere along
the way to this newer locality model we also made it harder to see the
unaccepted coordinates. Tracking history of change is even harder now
since it's several clicks away from creating a new specimen event. I still
argue that having that versioning of 'georeferences' can be invaluable and
nicely mimics legacy curatorial practices of striking out data but not
erasing it so future curators can see a history of data updates or fixes.

So in addition to push my name+date to specimen event option which
addresses Derek's important point a third option could be to create a new
specimen event, deprecating the previous one as 'unaccepted' and saving
that history.

On Tue, Sep 15, 2015 at 6:33 PM, DerekSikes notifications@github.com
wrote:

that change sounds great to me.

On Tue, Sep 15, 2015 at 3:30 PM, dustymc notifications@github.com wrote:

have Arctos auto-magically change 'specimen_event.assigned_by' to that
agent's name & the new date?

Magical probably needs to go through the group (might not be a bad
default
behavior) but...

[image: screen shot 2015-09-15 at 4 26 40 pm]
<
https://cloud.githubusercontent.com/assets/5720791/9892751/b68c4276-5bc6-11e5-9591-f032967dd517.png

can go out with the next Arctos release.


Reply to this email directly or view it on GitHub
#752 (comment).

+++++++++++++++++++++++++++++++++++
Derek S. Sikes, Curator of Insects
Associate Professor of Entomology
University of Alaska Museum
907 Yukon Drive
Fairbanks, AK 99775-6960

dssikes@alaska.edu

phone: 907-474-6278
FAX: 907-474-5469

University of Alaska Museum - search 302,939 digitized arthropod records
http://www.uaf.edu/museum/collections/ento/
+++++++++++++++++++++++++++++++++++

Interested in Alaskan Entomology? Join the Alaska Entomological
Society and / or sign up for the email listserv "Alaska Entomological
Network" at
http://www.akentsoc.org/contact.php


Reply to this email directly or view it on GitHub
#752 (comment).

@dustymc
Copy link
Contributor Author

dustymc commented Sep 16, 2015

This is a long thread

tl;dr: so let's build a new model.

disassociating the georeferencer

NOBODY (should) CARES! The shape/description has something useful to do with a specimen or not. I don't care if someone used a random coordinate generator (map+dart?) and got lucky, ALL that matters is your assertion that a specimen belongs there.

If you insist on caring, then you can't also care (much) about "duplicates" (and near-duplicates), at least not in this model.

harder to see the unaccepted coordinates

Someone asked for that - figure it out in the group and I can easily turn them back on (in the non-tabular forms - multiple anything-that-doesn't-concatenate remains a problem in tables).

Tracking history of change is even harder now since it's several clicks away from creating a new specimen event.

But it was impossible in the old model! Unless you're talking about JUST coordinates (eg, 2 of the three dimensions of a shape), which is an extremely limited (and I believe severely abused in the old) use case.

I still argue that having that versioning of 'georeferences' can be invaluable and nicely mimics legacy curatorial practices of striking out data but not erasing it so future curators can see a history of data updates or fixes.

Old model, you could add coordinates. New model, you can add coordinates - and also change the county while tracking the old. (And deal with depth/elevation.) The new model saves everything the old can, and a lot more, slightly differently, and associates it with specimens in more-functional way. "Legacy curatorial practices" were developed before GPS was a thing and involved pretending that parasites were parts of hosts, that hosts are just a string in a text field, that cultural collections are not largely made out of things with interesting DNA, and that we'll never encounter an individual twice or send bits to two collections.

So in addition to push my name+date to specimen event option which addresses Derek's important point a third option could be to create a new specimen event, deprecating the previous one as 'unaccepted' and saving that history.

If it's JUST specimen events you're talking about, I can do that. But you're probably not because nothing is important there - you want the old coordinates/continent/etc., right? See #579 - we can (probably) do that but it's far from trivial.

  • You are not likely to convince me that the old model had any specimen-functionality that's not in the new. It didn't. (Weird hyphen because it DID have better "tracking who put coordinates to names" functionality. That's only half the coordinate/names possibilities and there's no GBIF-for-georeferences (GlobalTWPCTNFacility?), so I still think that's irrelevant-enough.)
  • You could convince me that whodunit is somehow important anyway - fine, let's do it, there's a functional cost (which we're already partially paying), scroll up somewhere.
  • The error logs, discussions, and questions have already convinced me that this model has introduced some significant usability issues. (I think they may be partially expectation-based and if so they're likely to follow us to anything that can deal with parasites and manufacturing origins.)
  • Geography has been hiding specimens for a long time.
  • The current model is a metadata-laden "compromise." The way we're (mis)using those metadata we have to drag around everywhere makes me sad (quick, find never-transcribed/always-digital coordinates).

Let's start blank-slate; tell me what data you have, why you care about it, what you want to do with it, etc., and we'll find a model that does that. (I've got a short list of things that I care about too, but I think they're all pretty trivial/obvious.)

Or if that seems overwhelming we can patch who/when in to the current model, but again that does come with a cost in what can be done elsewhere. (And I'm not sure it addresses your concerns??) Maybe the blank-slate approach leads here, maybe not, but it would be good to find out before we end up in some sort of panic situation (what lead to the current model).

@dustymc
Copy link
Contributor Author

dustymc commented Sep 17, 2015

magically change 'specimen_event.assigned_by' to that agent's name & the new date

is implemented w/ https://github.com/ArctosDB/arctos/tree/v7.0.5, leaving this open.

@dustymc
Copy link
Contributor Author

dustymc commented Oct 3, 2019

#2274

@dustymc dustymc closed this as completed Oct 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants