-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Geography Proposal #3272
Comments
Issues meeting:
has potential, implement, gather some data, expose internally and in limited scope (eg, from higher geog edit page), then analyze and decide how to proceed AWG: Go |
This is a goldmine. I am going to blithely steal from it as I work on the Locality Services. |
Built it, and we shall steal.... |
Seems fair.
…On Thu, Dec 3, 2020 at 4:35 PM dustymc ***@***.***> wrote:
as I work on the Locality Services.
Built it, and we shall steal....
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQ726CO2BN5JA4HH4AVT3SS7R6ZANCNFSM4UMLYODA>
.
|
I went with a fairly-normalized model, should be pretty easy to shuffle things around if it causes some sort of problem.
It's talking to Google, and keeping only
which are the only "geography-like" terms I could find in that particular API. That's easy to adjust if someone wants something else; Google seems to know a lot about rooftops... Plugging in to other APIs should be trivial, so if anyone knows of anything that'll take coordinates and return something that someone might consider geography, please let me know about it. http://test.arctos.database.museum/place.cfm?action=detail&locality_id=1178173 looks like.... It would be pretty easy to use those terms and/or ranks in search, assert them instead of or alongside "curatorial geography," or whatever turns out to be handy. It won't be very interesting until some data are gathered. @mkoo if we have the bandwidth I could temporarily be more aggressive with the cacher after this goes to production, which might happen in a couple hours. |
GBIF has made a reverse geocoding API available that uses GADM and
marineregions.org EEZs
The code is here:
https://github.com/gbif/geocode
And here is an example API call:
http://api.gbif.org/v1/geocode/reverse?lat=-41.0570673&lng=-71.5268821
In the response, if a distance is non-zero, then it is the minimum distance
in degrees to that administrative division.
…On Tue, Dec 8, 2020 at 9:50 PM dustymc ***@***.***> wrote:
I went with a fairly-normalized model, should be pretty easy to shuffle
things around if it causes some sort of problem.
create table place_terms (
place_term_id serial not null,
locality_id bigint references locality(locality_id) on delete cascade,
term_type varchar not null,
term_value varchar not null,
source varchar not null,
last_date date default current_date
);
It's talking to Google, and keeping only
administrative_area_level_1,administrative_area_level_2,administrative_area_level_3,country
which are the only "geography-like" terms I could find in that particular
API. That's easy to adjust if someone wants something else; Google seems to
know a lot about rooftops...
Plugging in to other APIs should be trivial, so if anyone knows of
anything that'll take coordinates and return something that someone might
consider geography, please let me know about it.
http://test.arctos.database.museum/place.cfm?action=detail&locality_id=1178173
looks like....
[image: Screen Shot 2020-12-08 at 4 45 25 PM]
<https://user-images.githubusercontent.com/5720791/101558916-d35cdc80-3974-11eb-841b-541c13cbdc57.png>
It would be pretty easy to use those terms and/or ranks in search, assert
them instead of or alongside "curatorial geography," or whatever turns out
to be handy.
It won't be very interesting until some data are gathered. @mkoo
<https://github.com/mkoo> if we have the bandwidth I could temporarily be
more aggressive with the cacher after this goes to production, which might
happen in a couple hours.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQ726J6QKG435VZIHE26LST3CVFANCNFSM4UMLYODA>
.
|
For the followup of making generated coordinates more visible, there's a new operator button on specimen detail for no-coordinate events. Two clicks... ...and... ... happens. It's not a great georeference - there is no error calculation - but I've clicked the button perhaps 50 times and nothing meaningfully "wrong" has happened. (Maybe I'm bad at picking test cases!) There is a map available before the second click, should anyone want to review it before clicking - this is simply a new path to an old tool. The georeference will need further work to be suitable for all use cases, but it also makes the record available to spatial tools where it can be more efficiently improved; even horribly incorrect georeferences seem like an improvement from that perspective. I'd be happy to talk about further lowering the bar, should anyone or everyone want magical coordinates without the clicking. |
I would reject everything with distance >0. Those are near neighbors in
case the geometry is vague or if you want to apply admin values to
near-offshore coordinates.. But I wouldn't want to propagate false
positives. But maybe I just like things too simple.
…On Wed, Dec 9, 2020 at 6:46 PM dustymc ***@***.***> wrote:
Thx - I did eventually remember that...
[image: Screen Shot 2020-12-09 at 1 41 01 PM]
<https://user-images.githubusercontent.com/5720791/101691896-3a859a00-3a24-11eb-8216-74be77d20ecc.png>
I've got it set to grab everything for now - I suspect we'll end up
filtering and deleting some stuff at some point. Given the (vague and
potential) intent of this, perhaps it's best to preemptively reject
everything with distance>0?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3272 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQ72YK7IFYZASRB2KHQOLST7V2ZANCNFSM4UMLYODA>
.
|
Sounds scary, but I guess we could give it a try.... Done, in production, cache-checker-thingee is running a little harder than normal @mkoo |
This has processed ~20K localities so far, there's perhaps enough data for patterns to begin emerging. https://arctos.database.museum/place.cfm?action=detail&locality_id=1116141 had just finished when I checked in, seems fairly normal. Locality terms:
GBIF:GADM0,GBIF:GADM1,GBIF:GADM2 pretty consistently form country:state:province, they seem like a suitable solution to #3186. Google:country,Google:administrative_area_level_1,Google:administrative_area_level_2 could serve the same purpose. [Dis]agreement between those things could be a useful metric. This does seem capable of providing a consistent, limited set of search parameters which will return ALL (or the 97% I can get coordinates for) items from a placename. The "all localities" report @mkoo asked for are a decent reflection of the all-localities map.
they're both all over the place, might be useful for demonstrating that we need funding to resolve #1679, but they're not useful for addressing spatial questions. There's some limited oceanic data in GBIF - https://arctos.database.museum/place.cfm?action=detail&locality_id=80080 is the first "mostly wet" locality I stumbled across, the service seems to be at least as useful as the asserted data. I think the important point for this is that figuring out marine things isn't an Arctos problem under this model, it's a community problem. If GBIF (who certainly has far more resources than Arctos) does something clever it'll magically find its way in to Arctos, if someone else does something we should be able to plug in to their API. @sharpphyl This seems to be working far better than I'd expected. I suggest we begin thinking about how to make it available in the UIs, how to distinguish it from "curatorial geography," and perhaps even how to share it back to GBIF via DWC (which should stop the flagging that seems to annoy some users). |
https://arctos.database.museum/place.cfm?action=detail&locality_id=10824871 is interesting. There's no WKT for the drainage-in-county. Without something like #3108 (which would get at "in county" but not "in drainage") it's difficult to say if the coordinates are reasonable or not. GBIF is returning "Bernalillo" for GADM2, strongly suggesting that the coordinate/curatorial geography alignment is in fact not reasonable. While not a replacement for better WKT, this looks like it will expose useful ways of detecting low-quality data. |
Nice. Re: standardized place name - It isn't really the place name, it is the geography, right? Why smaller, maybe some other way to separate it, just call it "Service Asserted Geography? Also, how about a "more" link to that? Possible? Maybe "Higher Geography" should be titled "Curatorial Asserted Higher Geography"? Or maybe we just need a section here that is "Curatorial Asserted" and another that is "Service Asserted" or something like that. |
For now - yea, more or less, I think, whatever that means..... Potentially, it's whatever we find at some place - certainly marine (no geo) stuff, maybe there's something cool in Google's rooftop data, whatever. I'm struggling to find a name that might accommodate that, suggestions greatly appreciated.
There are 2 in the area that will get you there. The one with locality is the more relevant, that may or may not say something useful about the label.
That's what it IS in my view, but we use higher_geog[raphy] in many places, and I don't want this to turn in to something that someone finds offensive - I think that might be a little overly aggressive.
It's "Service-Derived" in /place - "Asserted" might be better - accurate, but does everyone know what that means? |
Those "more" take you to things that are more of those. This is probably a bad example because the HG and the "Standardized Place Name" are essentially the same, but if the SPN was different from the HG, then I would assume that "more" would be a different set of stuff - No?
Verbatim?
Service Derived seems good. |
See https://arctos.database.museum/info/reviewAnnotation.cfm?ANNOTATION_GROUP_ID=37714 The webservice data is pulling in a nearby county, in this case unnecessarily/incorrectly. Can/should we do anything about that? |
This is now searchable in https://arctos.database.museum/SpecimenSearch.cfm |
How the webservices works is changing a bit (unless @mkoo has a dramatic change of heart!). This is running in test, will probably be in production tonight. The data will take some time to catch up. GeoLocate is now the primary source of coordinate-from-text data, and it generally returns NULL (translation: "I have no idea what you're talking about") for variations of "most precise available term from geography" is currently I am now being more explicit in source. The locality detail page now looks like... note "asserted" (from curatorially-supplied coordinates) and "derived" (from coordinates I've produced from the text data). The catalog record now looks like...
I don't think any of this is incompatible with idea of "categorizing" localities (from a couple comments up); that would add another dimension on what we can use to detect conflicting data, and would still be useful (eg in ignoring terrestrial, overly precise, whatever terms) if we do want to assert a "standardized" place name at some point. |
I'm closing this. We're pulling in standardized geography data and it's available for search; it is not completely impossible to predictably find things by geography terms in Arctos, that is always the primary goal. If a collection wants to take that farther, a new Issue can be opened. |
Background
I can see no evidence that the recent efforts in geography cleanup have resulted in more discoverable catalog record data, which I presume to be a core use case for maintaining geography. It's still possible for data entry personnel to assign arbitrary geography to records, and without consistency predictable geography text search results are not possible. See #3249 for example.
#3186 is a proposal to find more consistency in these data, but it will result in significantly reduced functionality in several areas. I don't see this as an acceptable tradeoff, and I don't think Curators will or should either.
Our current geography model does offer several valuable tools for georeferencing and confirming that georeferences fall within specified geography areas, but this still does not provide a consistent mechanism for locating cataloged records by geography.
Arctos has for some time been using various webservices to find coordinates for records without them, and to associate coordinates (both asserted and derived) with place names from various webservices. This is useful for search, but there is not formality or consistency in these data; they're just search strings.
Proposal
Retain the existing geography model, which allows "traditional" curatorial assertions (which support various internal functions - organizing material by Quad, for example).
Split the derived geography out into a separate, structured, formal table. This would allow consistent searching - all records from http://www.geonames.org/5880054/barrow.html would be discoverable as "United States","Alaska" and "North Slope" for example. For contrast, current data would require somewhere between three and 16 queries (depending on level) to find the desired "Barrow-ish" records.
Implications
This would immediately result in more discoverable (by virtue of consistency) data in Arctos. One query - rather than the currently-required 16 - would find records from Barrow.
Longer term, we could discuss making these data more visible, perhaps sharing them via DWC, etc. This is essentially an implementation of #3186 but as an enhancement rather than a replacement.
This approach also has significant future-proof qualities. A county's new name will become available for searching as soon as it's entered into a service we use, with no curatorial work involved. Using a new/better/specialized service would be a matter of making Arctos aware of it.
No changes would be required to catalog new material.
Future changes to "curatorial geography" would not be so wide-ranging; we might be able to more readily accommodate curatorial needs without reducing functionality to users.
In short, I think this would result in drastically more discoverable data with no additional curatorial work, and without asking Curators to give up anything. It would also retain all of the work we've put into cleaning and organizing geography.
Followup
This approach would rely on coordinates to retrieve the consistent geography data, and so I also propose that we make the derived coordinates more visible, and more available to collections who wish to use them, as an immediate followup. It would be trivial to create a georeferenced Specimen Event for cataloged records without one, for example. This would not be a particularly "good" georeference, but it would make any problems much more discoverable by providing a path to spatial tools, and could be flagged as automation in various ways (a new value in https://arctos.database.museum/info/ctDocumentation.cfm?table=ctverificationstatus is perhaps most "filter-able").
For scale, Arctos currently holds 688778 localities, 496467 (72%) of which have curatorial coordinate assertions. 668709 (97%) have service-derived coordinate assertions.
Related Issues
In no particular order. I got overwhelmed and gave up trying to better organize these, you can too! There are a few "themes" in these, but they're often broad and intermingled.
Some Issues are incorporated in this proposal. There's nothing new here, it's just a no-compromises merger of existing ideas. Restructuring geography, incorporating various Standards and Services, and being a more involved member of the larger community are inevitable, for example.
Some Issues become less important if not irrelevant under this proposal. Choosing curatorial functionality over discovery has little impact with this 2-part approach. Inconsistent data has a much shorter reach. Using "modern" geography is not as pressing, perhaps not even desirable. Lacking a universal definition of geography or idea of the goals is not necessary.
Some Issues change very little, or not at all, under this. Adding spatial data will enable the same awesomeness under this proposal, for example.
Structure
Table formal_geography could take two general shapes.
A normalized structure would provide more flexibility, but is more difficult and expensive to query
would support any number of terms of any rank (including none), and generally be more capable of representing whatever comes in from Services (including that cool new thing which hasn't been built yet). It would also be expensive to query, difficult to access, impractical to flatten, and perhaps difficult to "translate" (eg, we end up with 12 ways of saying "country" from various sources).
A more flattened approach would serve the core use case of discoverability, could be treated like a spreadsheet for various purposes, but would not be completely faithful to service data.
Both would require some way to tie to "core" or "curatorial" data (probably Locality). A linking table would provide a mechanism to tie many assertions to a locality, which seems necessary, and a mechanism to tie many localities to an assertion (which could reduce the data we must store, but I don't anticipate using this direction).
An alternate would be adding
locality_id fkey-->locality
directly into the formal_geography table, which might make sense with the flatter version.The text was updated successfully, but these errors were encountered: