-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iDigBio Flags on Continent #1291
Comments
@DerekSikes noticed something similar in GBIF regarding dates. Darwin Core is an exchange standard; Arctos isn't "complying" with any data standards because none exist. I agree with your assessment: User's initial reaction to the flag will be "Arctos is broken," which is absolutely not the case. |
We've done a bit of thinking about this internally. Right now there are some data quality flags from iDigBio that are useful because they correct objectively incorrect data, like a mismatch between coordinates and country due to a missing sign. Others, like your Pacific islands example, are subjective to how data are stored in Arctos vs. other models. Many of the objective DQ tests would flag errors that we don't have because Arctos/Dusty also catches them (e.g. "April 31st is a date that doesn't exist"). The subjective ones I don't think are worth our time to care about at this point, in particular because the DQ tests and methods iDigBio uses are in flux due to work being done in TDWG. The TDWG Biodiversity Data Quality task group has a few factions working on different aspects. One is trying to define a framework for what we even mean when we talk about data quality as a collections community. Another is getting all the aggregators, including iDigBio, to settle on a set of the same data quality tests to run on provider data and return flags for. I don't actually think the flags are visibly negative enough to make users think "Arctos is broken." I would hope (although I guess hope is the operative word here), that people who are running analyses on or otherwise using aggregator data for something beyond browsing would notice that the flags are doing more standardizing than correcting, and that obviously different collections/databases use different but equally correct ways to say the same thing... |
I think I know what is going on here now and it would be a change to higher geography. While at SPNHC, Robert Mesibov offered to review some Arctos data for me. He downloaded the MSB fish data from iDigBio and reviewed the RAW file. One of the issues he found was that all of the stuff coming from oceans had no water body and instead the body of water was in the DWC_Continent field. In Darwin Core, Atlantic Ocean is a body of water, not a continent. I thought that it would make sense to call the tectonic plate the "continent", but that isn't how iDigBio does it. They use political boundaries for continent. So DMNS:Bird:18967 in Arctos shows a continent of "Atlantic Ocean" in Arctos and no associated water body. and DMNS:Bird:18967 in iDigBio shows a continent of "Europe" and has the flag DWC Continent Replaced. Strictly speaking, we are both wrong but I doubt that anyone searching in iDigBio for Europe wants stuff from the South Georgia Islands. And when I search iDigBio for insitution code "DMNS" plus water body "Atlantic Ocean" I get no results. At least anyone searching Arctos for stuff from the Continent/Ocean field for "Atlantic Ocean" will find this specimen (I tried it and it worked!). All this being said. It seems to me that there needs to be a wider community discussion about Continent and Bodies of Water but in the interest of making our stuff more searchable in iDigBio (and GBIF I'm betting), I suggest that we add Water Body to higher geography and for anything with a continent that is really a water body we add the correct name to the water body field. iDig Bio will still replace our "Continent/Ocean" information, but the correct water body will get there, so people searching the oceans will find our stuff. |
BTW, I added the whole continent/ocean issue to the TDWG data quality GitHub. |
This is an aggregator doing something indefensible (which you've explicitly permitted by licensing your data CC0). This isn't an Arctos issue (there is no standard of which I'm aware), and it's not a DWC issue (the data are being properly transported to the aggregators). There's been a "community discussion" going on for 32 year(this is what TDWG was formed to do) with no resolution. What we NEED is a usable authority. Arctos could become that or plug into something else; both are technically trivial. (What's Kurator using?) I dislike waterbody. I fail to see how the few miles of sometimes-wet sorta-ditch behind the farm (it's in Getty) is the same sort of data as states and counties. |
woops |
@ArctosDB/geo-group , please read John W's response. |
We could (theoretically - it may push this into 'infrastructure-limited' territory) use a non-DWC vocab and translate. Eg if ya'll really like 'Central America" as a continent then we could push it and North America to 'North and Central America' on export. (Or maybe that's a horrible idea which just ensures that someone finding something in iDigBio can't find it in Arctos and vise-versa.)
#1107 - we regularly violate this principle and seem resistant to stopping that.
Maybe that's correct and Oceania only refers to the dirt-parts??
I'd say that's just wrong (and that's why we've added "drainage" and not "waterbody" the the geography table). There's a LOT of stuff in "Cimarron River Drainage" which isn't anywhere near the Cimarron River (or any other water!). And #1366 is still unanswered, but I don't think a pond is included within what we generally see as geography. Maybe that's an indication that trying to draw a line between geography and locality is not a useful thing to do. And I'd like to amend my assertion above: what we NEED is a lookup service which turns shapes into whatever sort of text string anyone might want. (We already have that, but it's not very good, not very structured, and not very exposed - it just supports "any geog" queries, and it does so from points. We also have services to turn strings into coordinates, but that quickly becomes circular - at least sometimes, I'm inclined to support our current model which treats those coordinates as suggestions and relies on a person to accept them as "data.") |
See also tdwg/bdq#172 After looking into this - I have to agree that our current "Higher Geography" is misleading in searches. DMNS:Bird:18967 provides a good example. Its higher geography is: Atlantic Ocean, United Kingdom, South Georgia & South Sandwich Islands, South Georgia Islands, South Georgia As John W. points out, an island is not part of the ocean (a water body). iDigBio moves this specimen to: If we were following the ISO 3166 codes, we would have a higher geography of:
AN = Antarctica Which makes sense if you are searching by continent or country. ISO 3166 would be far more stable than Wikipedia and we would stop the madness of finding Magellanic Penguins in the United Kingdom (which most certainly happens in Arctos). |
Here's your link - click "requery" on the "show/hide" widget to get a URL. http://arctos.database.museum/SpecimenResults.cfm?scientific_name=Spheniscus%20magellanicus&scientific_name_scope=currentID&scientific_name_match_type=startswith&country=United%20Kingdom I don't really have a problem with those data - the UK is a political entity, not a place. More on that below... I dislike ISO codes as they line up with our data; the intent/meaning is drastically different. We record (sometimes...) what was there when the specimen was collected (or georeferenced, or when the label was printed, or ...), ISO codes refer to something else, those don't always have much to do with each other, and we don't have the resources to update our data when something changes. "Yugoslavia" could refer to lots of shapes (https://www.youtube.com/watch?v=Ic5tBXESxl8) while ISO 3166-1:890 is 1) just https://en.wikipedia.org/wiki/Socialist_Federal_Republic_of_Yugoslavia#/media/File:Yugoslavia_1956-1990.svg, and 2) a withdrawn code.
One problem is that we (and GBIF, apparently) have a crazy mix of geography and politics in the data, and often no way to tell them apart. The UK is most certainly not (entirely) in Europe, nor does the name have any sort of spatiotemporal stability.
That brings up the question of where exactly the island ends and the ocean begins. Mean high tide, the exclusive economic zone (for island nations), some arbitrary point established by some historical event, the place where the collector felt they were no longer close enough to the island to record that, ... ? I'm not sure there's a One True Method for any of that which involves strings. It's all fairly trivial with georeferences - just ask some service capable of responding with the data you want. Theoretically anyway - hard to say what might happen with this input: |
Taxonomy Committee had a brief discussion about this. People searching at VertNet, GBIF and iDigBio will not find some of Arctos records due to mismatches between the Continents we use in Arctos and those they use (apparently a standard set) see tdwg/dwc-qa#128 (comment). Although it would be a lot of work, I think we need to review all higher geography that uses an ocean as the "continent". As John W. pointed out, Hawaii is not part of the Pacific Ocean (it is not water) and if we are sticking with political divisions for higher geography, then Hawaii should be part of North America. see also #1291 (comment). I also think we should consider how our continents map to those used by the aggregators:
Everything that we have in any of the oceans is likely lost in many searches of aggregators and that could be a lot of things. Actually, I find our continent/ocean list a bit perplexing...why did we decide to make the West Indies a continent? The West Indies is a subregion of North America - https://en.wikipedia.org/wiki/West_Indies How is that any different from "Patagonia"? |
Hi folks, rather than rehash what I think are the issues with how GBIF interprets continent, I urge you to read the issue I presented to them, as it will explain a lot about why you see what you see in GBIF. |
Careful everyone. The VertNet principle of best practice suggests how to do it, it does not say that everyone has done it, or that an assumption to that effect is sage or safe. |
I think that's our primary question here.
Second is how aggregators and other not-us users interpret those data. The easy solution to that is to just share a model. |
To me it needs two parts, the shapes and the thesaurus that connects to it. One could approach geography from the spatio-temporal perspective or from the names perspective. You could do things like: reverse geocoding: Tell me the standard administrative region names for this point (at this time). Here is an example that uses GADM - https://api.gbif-uat.org/v1/geocode/reverse?lat=48.17156&lng=1.18177. get preferred name - I wanna search on the name of a place as I know it and let something translate that into the preferred name used in an index so I get everything I am looking for. This would take a combination of something like TGN (http://www.getty.edu/vow/TGNServlet?english=Y&find=Sudamerica&place=&page=1&nation=), which does have web services now, and an index that actually is standardized against the preferred names. |
Hey, that's pretty cool, thanks! I'll add it to my scripts. Interesting that marineregions.org doesn't seem to have great offshore vocabulary - I'm coming to the idea that there's just no such thing, and trying to fake it (eg, by referring to something dry and far away) only adds to the confusion. https://api.gbif-uat.org/v1/geocode/reverse?lat=38.086621&lng=-122.394955 https://api.gbif-uat.org/v1/geocode/reverse?lat=37.761077&lng=-122.801543 https://api.gbif-uat.org/v1/geocode/reverse?lat=37.382637&lng=-123.419142 https://api.gbif-uat.org/v1/geocode/reverse?lat=29.527412&lng=-138.532940 |
@dustymc |
That's a possibility. I was thinking more radically, but I'm not sure how realistic anything is. If we do something, we'd need to do something consistent. It looks like they end 'continent' right about the golden gate bridge - you OK with that? The Faralons are part of SF County, adopting enough of this would leave us with a transcontinental county, that doesn't seem ideal.
Seems a bit optimistic, but maybe. Would be useful to see their basemap rather than trying to reverse engineer it.
It might - presumably they built this for their own use. |
There are only 3 HG entries with Eurasia, Russia |
Create an Uber-geog level above continent just for Eurasia? |
That won't save you from all the other trans-continental country problems.
See VertNet/DwCVocabs#56.
…On Tue, Sep 1, 2020 at 8:22 PM Mariel Campbell ***@***.***> wrote:
Create an Uber-geog level above continent just for Eurasia?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1291 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQ72ZJH6ZXRJOVQSIQBT3SDV65FANCNFSM4D55CXIQ>
.
|
I don't understand why that matters. The most precise information we have doesn't fit into the normal "hierarchy" (it's not, because the world isn't).
|
I agree that this is what we should be doing. The only issue arises when we have a locality = "Russia" (or does it? In this case, I would suggest that HG = no higher geography and that "Russia" be included in Specific Locality OR there should be two localities provided one with HG = Asia, Russia and one with HG = Europe, Russia. |
Also, I can figure out the 3 Russia HG in Eurasia and put them on the appropriate continent. |
I think that's in my "evil" category - it's purposefully "demoting" data to meet our unrealistic expectations.
That works for search, might not be evil, still seems pretty janky to me.
That does not seem possible. One is a country that spans both. One is a former, bigger, country that spans both. One has this: |
I think that using Eurasia is every bit as evil.
Janky, maybe, but it gets the job done (IMO - could be completely wrong).
See first comment above. We have "Asia, Russia" and "Europe, Russia". Assign two events with both localities to the records that use "Eurasia, Russia". BTW, I think some of these could have more appropriate HG
Aren't we supposed to be using "current" HG? Some of these could be made better and for the rest "no higher geography" with Soviet Union in the spec loc seems not so evil, since they are just the vague anyway.
See fix as applied to "Russia". Also, pretty sure these could be sorted onto the correct continent, since they have coordinates... |
It would be nice to have a bit of wiggle room so our coordinates could be 100' off shore and not create an out-of-bounds, but if we had EEZs to work with right off the bridge, it would probably be ok. This issue has gained a lot of Where's Russia? influence so maybe the rest of this comment belongs elsewhere, but it's related to the question of how to deal with offshore locations. A consortium of Museums (I don't think any are in Arctos) recently received a grant https://www.nsf.gov/awardsearch/showAward?AWD_ID=2001510&HistoricalAwards=false that is focused on geolocating specimens on the US eastern seaboard. Here is part of their proposal: This project will generate reliable geo-coordinate data for all covered specimen lots using a collaborative georeferencing project in GeoLocate. GeoLocate will add layers for bathymetric data, benthic habitat, and marine conservation areas. Incorporating bathymetry into GeoLocate to determine the extent of locations will also provide that capability for complex elevational data for terrestrial species....The data will be shared through public data repositories, including iDigBio, GBIF, OBIS, and the InvertEBase Symbiota portal. I asked Dr. José Leal at the National Shell Museum, one of the participants, if, in addition to geolocating specimens more precisely, the project would result in a marine locality structure that could be used by other museums with specimens from similar locations. His reply: Yes, that is the idea. We have Nelson Rios from Geolocate as a PI in the grant, so some of the more technical questions will be resolved by him on this. For marine localities we'll be adding station coordinates (which is nothing new), but still need to resolve how to handle "stations" without coordinates ("off Cape Sable, etc.) Not sure there's anything in the work they are doing that will be helpful for us, but I thought I'd add it to the stew just in case. |
Taken to extremes, would that require a "France, 1800" record to have about 80 determinations?
That idea died an agonizing death under the pressure of reality; it's a nice ideal, but it would require a tremendous amount of work every time someone moves a border.
It's less vague than the alternatives.
It does not involve discarding data, so I have to disagree. Splitting Sverdlovsk Oblast or San Francisco County across two made-up pigeonholes doesn't seem terribly conducive to discovery, nor does dumping Norway and India into one made-up pigeonhole. I have no idea what we should do, but I do not think it will involve removing precision at any scale. |
Closing as we are not addressing the original issue. |
As my data was recently ingested by iDigBio, I received a huge list of specimens flagged for various corrections (sigh). I wanted to bring this one to the group to see if we should be paying more attention to Darwin Core, or if it is just something to let iDigBio keep "correcting" for.
Some of my specimens on islands in the Pacific, are flagged by iDigBio with "dwc_continent_replaced | Darwin Core Continent Corrected." one example is here:
https://www.idigbio.org/portal/records/89015b8e-d745-430c-b846-8b250b62afcb
Is Arctos not complying with Darwin Core or is this just an artifact of iDigBio? Do we need to do anything about it or do I just need to know that these flags are not a problem? My main concern is that users of iDigBio will view our data as less reliable with flags attached.
The text was updated successfully, but these errors were encountered: