-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
higher geography lookup is slow #2874
Comments
@dustymc Dusty, I'm for some reason unable to add labels to issues. Did something change in permissions or something? |
@mkoo @Jegelewicz should Arctos Users have Write on ArctosDB/Arctos or is that some other Team (which @mvzhuang should be a part of)? https://github.com/orgs/ArctosDB/teams/arctos-users/repositories |
Vicky, see if you can now!
…On Wed, Jul 8, 2020 at 10:58 AM dustymc ***@***.***> wrote:
@mkoo <https://github.com/mkoo> @Jegelewicz
<https://github.com/Jegelewicz> should Arctos Users have Write on
ArctosDB/Arctos or is that some other Team (which @mvzhuang
<https://github.com/mvzhuang> should be a part of)?
https://github.com/orgs/ArctosDB/teams/arctos-users/repositories
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2874 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATH7UO3J2XPFNVWCLBEP5TR2SXVHANCNFSM4OQCLZUA>
.
|
Yay labels are fixed for me! Thanks! |
ok then fixed for Arctos Users group then! |
Yes labels work, but @dustymc still needs to resolve the issue.... |
The original issue is fixed, but stripGeogRanks isn't performing adequately, and it's going to take some time to somehow address that. Needs prioritized. |
Looks like PG's generated columns would serve this purpose, but that only exists in PG12 and my test box is PG11. |
Blocked by https://github.com/ArctosDB/internal/issues/65, going back to needs discussion |
Played with this some more, the issue seems to be that geography has grown by a great deal, largely with the addition of "subquad" data in quad, and partially from eg #1278 ("minor" features are treated as geography). I've reduced the defaults on the form so it's more functional, but remains slow, albeit still probably orders of magnitude faster than not having the form. Two obvious possibilities:
|
@dustymc is this only an issue for the various components? So if I use option 2 and the strings I enter are only compared to the concatenated higher geog strings, would that be less problematic? |
I'm not sure, it probably is faster, but it's also a LOT less likely to figure things out when comparing big disorganized strings. |
@dustymc Maybe we make the first step "is this string there?" So, when I have
and that is already there - no further work is required, just say "in Arctos". If it isn't there, just say "FAIL" kinda the way the taxonomy name checker works. What this thing is currently doing is not going to be useful in any big set of data. I have 39 HGs and it returns them 2 at a time after about 5 minutes of processing - that means hitting refresh 20 times and waiting 100 minutes! |
I mean, I see the misspelling in California - why is Tehama County the problem? |
You can probably just pull table geog_auth_rec for now - or not, I'm not sure, I can get it out if you can't.
Type to pick - its suggesting what it knows (or choking in the attempt, or something).
I've cleaned a couple million records with it, but yea it's not ideal like it is. First question is whether we bother trying (and continue failing) to standardize geography at all. If we do, then we need to decide what "geography" means - the bajillion not-quite-quads (and waterbodies and maybe other stuff) are pluggin' the toobs, so we move them, or do a better job of organizing them, or cache more aggressively, or SOMETHING. If we get through all that, the "component loader" model (or something like it) does a good job of dealing with limited processors. |
Merging #1105 here - if we keep this these need added to stripgeogranks
|
@dustymc can we please make this better? See https://github.com/ArctosDB/data-migration/issues/1147 |
Yep, the component loader ecosystem gets around my problems, I'll go next task. |
Next release. Even the component loader wasn't able to handle the function-manipulated data at a reasonable rate, I rebuilt stripGeogRanks and added generated stripped_{field} terms to geog_auth_rec. It's some junk to store, but I think we can afford that (its tiny compared to spatial data) and processing is now reasonably fast. The loader returns up to 10 possible matches, and a status value that will hopefully help sort them out. "Just use the first" is probably a mostly-sorta-defensible position for eg, an incoming collection - it likely won't be WRONG most of the time, but it will probably not be of quite the right precision for lots of data. @Jegelewicz (or anybody else) if you've got any "raw" data - the uglier the better - please pass it along, there's room for lots of tuning. |
try this |
thx, script is a little smarter than it used to be. |
Betta, but what the heck? Shouldn't North America, United States, Texas, Aransas County also appear here? Also, can the first column hold the closest match?
North America, United States, Wyoming, Park County exists - the other stuff is nice, but knowing there is an exact match is task number one and the exact match didn't even make the list? |
Issue Documentation is http://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html
Describe the bug
higher geography lookup cleaning tool isn't working
To Reproduce
uploaded higher geography lookup for data cleaning and getting this error
Tried it with old files that worked before and it's still throwing the same error
http://arctos.database.museum/DataServices/geog_lookup.cfm?action=validate
Expected behavior
for selection of higher geography to show up
Screenshots
** Data**
If this involves external data, attach the actual data that caused the problem. Do not attach a transformation or subset. You may ZIP most formats to attach, or request a Box email address for very large files.
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
highergeog.xlsx
Priority
Github isn't letting me choose a label right now...
The text was updated successfully, but these errors were encountered: