-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agents - Need help with preparing agents for incoming collections #4526
Comments
Add new name as Michael Petersen Xxxx Collection bulkloaded agent for
preferred name?
…On Fri, Apr 8, 2022, 2:32 PM Teresa Mayfield-Meyer ***@***.***> wrote:
* [EXTERNAL]*
@ArctosDB/agents-committee
<https://github.com/orgs/ArctosDB/teams/agents-committee> and anyone else
who has an opinion!
I spent the last three days doing this - ArctosDB/data-migration#660
(comment)
<ArctosDB/data-migration#660 (comment)>
The data had already been cleaned and reviewed by collections staff, so it
was pretty good to begin with, still processing through the Agent
prebulkloader is a load and wait task, then reviewing the results can take
a while as well. There are a lot of what I would call "personal bias"
decisions made during review and my own personal bias might be different
from the first hour to the seventh....
Do I spend time digging into these two names?
Michael K. Petersen (Arctos)
Michael Petersen (Collection)
At the point in time I was working on these names, they were all that I
had, so I cannot look at what Michael Petersen collected or when or where
they collected. To add to the complexity - the list of agents I was working
with crosses all collection types (mammals, birds, herps, etc.) so I don't
even have a general ides of what the collection agents are associated with.
I can ask the incoming collection to dig in, but they have just spent
months cleaning the list that I processed. I would like to discuss how we
handle these situations so that data migration progress doesn't get bogged
down in people names, BUT we continue to keep our agent list as clean as we
can.
I proposed one solution in the linked issue:
Add the names and do the research after the data is in Arctos
I find the second appealing as long as we actually do it because then we
can look at the activity of the two close matches in Arctos and determine
if they are the same person. If they are, we can merge them, if not, we can
mark them "not the same as". Doing the research now means comparing
collecting dates/localities/etc. between your data and whatever is in
Arctos for the closely matched person, which is doable, but not as easy.
Other ideas appreciated!
—
Reply to this email directly, view it on GitHub
<#4526>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADQ7JBGBJ52S6NPHYIUG27LVECJVVANCNFSM5S5SJGUA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
This is the balance between the ease of creation and the ease of merge I bring up from time to time. If mergers are to be difficult, you probably really need to spend a week (month, year....) on this, and I doubt anyone wants to subject themselves to any aspect of that. Given that block, I remain a huge fan of marking anything that isn't obviously different to merge, and encouraging everyone with agent access to do the same.
I think that's a little too coarse, but (under the viewpoint I laid out above) I wouldn't spend too much time trying to figure out Aggressively "merging" in the pre-load stage - strongly preferring a sorta-similar existing Agent - might be a viable approach as well, as long as the verbatim gets included with the records. (Burke took that to extremes and didn't initially load collector-agents at all, they might have a useful viewpoint in this.) |
I don't know, wouldn't that tell people we give everyone a new name every time them move to a new collection? Also what are the Xxxx standing in for?
Does this mean finding as many people that are maybe the same as someone else and calling them the same person before you enter your first batch of names? |
I'm in favor of loading (cleaned) names and reviewing later. Perhaps a better way of flagging names could help. A flag that is deliberately placed (not computer generated) and doesn't result in agent merger in two weeks and Arctos sends data quality contacts summary emails of their agents with these flags. So for example, UWZM:Bird loads their agents. Whoever uploads the data flags the agents that came up as "needs review" (or any others they feel they want to review). When their data is loaded they get one monthly summary email, hey, you flagged this list of agents, go review them. |
Yuck....
Scripts think
For clarity, there are no "normal" computer-involved mergers. The occasional cleanup effort (which will still involve people) will be an Issue. A new value in https://arctos.database.museum/info/ctDocumentation.cfm?table=ctagent_status could serve as an actionable flag, but I'm not sure anyone will review yet another report. |
But doesn't that have the potential to propagate error? Ok, yes you have verbatim, and maybe a collection manager could figure out, "oh these aren't the same, let's fix that", while a less familiar user/someone from outside Arctos looking for data about people may think "huh, looks like this invertebrate paleontologist also collected squirrels, lets add that to wikidata!" or something. Maybe that's something we don't need to be concerned about? I don't know. |
I think EVERYTHING has that potential, except maybe the "spend a year" option (and I'm not so sure about it!). So, I think the question becomes what the easiest to deal with errors look like, which is generally a choice between
I think that approach best serves the public as well. If If we must have messes - and we probably must - then we should strive for manageable messes. |
maybe some user splits off the grasshopper-collector
So I'm still unclear on the best way to do this if preferred names are
identical?
…On Fri, Apr 8, 2022, 6:23 PM dustymc ***@***.***> wrote:
* [EXTERNAL]*
potential to propagate error?
I think EVERYTHING has that potential, except maybe the "spend a year"
option (and I'm not so sure about it!). So, I think the question becomes
what the easiest to deal with errors look like, which is generally a choice
between
1. There are many variations of an entity, we have thousands (89339 at
the moment) of those, that quickly adds up to millions of agents, nobody
can find anything and the hole just keeps getting deeper, or
2. Some agents get overloaded, but it's hard to avoid noticing the
same agent collecting grasshoppers in Madagascar and bears in Alaska 100
years apart, so
- maybe some user splits off the grasshopper-collector (eg if
they're trying to link up field notebooks or just have some familiarity
with the person or something) and things get a bit better, or
- they don't, but at least they don't make anything worse and the
next person still has a realistic path to making things better
I think that approach best serves the public as well. If Some Agent finds
Arctos, searches Agent and gets a giant mess they'll probably just shake
their head and wander off and we'll never hear from them again. If the same
search finds all of their stuff and maybe some extras, they might think
we're trying and be inclined to tell us what's not theirs.
If we must have messes - and we probably must - then we should strive for
manageable messes.
—
Reply to this email directly, view it on GitHub
<#4526 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADQ7JBCXYK4QCXGC42CEWKDVEDEY7ANCNFSM5S5SJGUA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Dealing with identical preferred names right now (and it's not a new collection technically, just a common name combination). But this is relevant to incoming collections especially! I think top priority needs to be loading records so clean up can happen afterwards. Could be by multiple means including aggressive merging but often times if it's not obvious right away then it will take time-consuimg work to sort out people names so save that for later. Not sure if notifcations would help but maybe a low-quality report? As for the identical preferred names, I'm unhappy with the current constraint. I am not a fan of the arbitrary parenthesis solution but I will resort to that with a simple (MVZ) after the name so you can see right away the responsible institution (or maybe MVZ:Arch?) But this is just my thing and your thing may be the (Xxxx although not sure that that is); in any case, I see chaos. I dont see why not a unique constraint on something else besides preferred name or a combo of the preferred name + another field (which would be required to force a redundant preferred name). Or a new field called fullname (then add your parenthesis if needed). It's just that preferred name gets revealed elsewhere and parenthesis makes it messy looking. to reiterate, I think the priority has to be loading as clean data as possible and not have agents be the barrier to cataloging |
I am also not a fan of merging duplicate preferred names that are 100 years
apart and opposite sides of the country. It makes more sense to me to have
the option of keeping preferred names separate unless we have enough info
to merge them, rather than merging them and then trying to figure out how
to parse them back out later.
…On Sat, Apr 9, 2022 at 5:18 PM Michelle Koo ***@***.***> wrote:
* [EXTERNAL]*
Dealing with identical preferred names right now (and it's not a new
collection technically, just a common name combination). But this is
relevant to incoming collections especially!
I think top priority needs to be loading records so clean up can happen
afterwards. Could be by multiple means including aggressive merging but
often times if it's not obvious right away then it will take time-consuimg
work to sort out people names so save that for later. Not sure if
notifcations would help but maybe a low-quality report?
As for the identical preferred names, I'm unhappy with the current
constraint. I am not a fan of the arbitrary parenthesis solution but I will
resort to that with a simple (MVZ) after the name so you can see right away
the responsible institution (or maybe MVZ:Arch?) But this is just my thing
and your thing may be the (Xxxx although not sure that that is); in any
case, I see chaos. I dont see why not a unique constraint on something else
besides preferred name or a combo of the preferred name + another field
(which would be required to force a redundant preferred name). Or a new
field called fullname (then add your parenthesis if needed). It's just that
preferred name gets revealed elsewhere and parenthesis makes it messy
looking.
to reiterate, I think the priority has to be loading as clean data as
possible and not have agents be the barrier to cataloging
—
Reply to this email directly, view it on GitHub
<#4526 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADQ7JBBRTNZ4MGO2T4GKJS3VEIF4LANCNFSM5S5SJGUA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
And I would use (MSB) in that case if that were the only alternative.
On Sat, Apr 9, 2022 at 5:23 PM Mariel Campbell ***@***.***>
wrote:
… I am also not a fan of merging duplicate preferred names that are 100
years apart and opposite sides of the country. It makes more sense to me to
have the option of keeping preferred names separate unless we have enough
info to merge them, rather than merging them and then trying to figure out
how to parse them back out later.
On Sat, Apr 9, 2022 at 5:18 PM Michelle Koo ***@***.***>
wrote:
> * [EXTERNAL]*
>
> Dealing with identical preferred names right now (and it's not a new
> collection technically, just a common name combination). But this is
> relevant to incoming collections especially!
>
> I think top priority needs to be loading records so clean up can happen
> afterwards. Could be by multiple means including aggressive merging but
> often times if it's not obvious right away then it will take time-consuimg
> work to sort out people names so save that for later. Not sure if
> notifcations would help but maybe a low-quality report?
>
> As for the identical preferred names, I'm unhappy with the current
> constraint. I am not a fan of the arbitrary parenthesis solution but I will
> resort to that with a simple (MVZ) after the name so you can see right away
> the responsible institution (or maybe MVZ:Arch?) But this is just my thing
> and your thing may be the (Xxxx although not sure that that is); in any
> case, I see chaos. I dont see why not a unique constraint on something else
> besides preferred name or a combo of the preferred name + another field
> (which would be required to force a redundant preferred name). Or a new
> field called fullname (then add your parenthesis if needed). It's just that
> preferred name gets revealed elsewhere and parenthesis makes it messy
> looking.
>
> to reiterate, I think the priority has to be loading as clean data as
> possible and not have agents be the barrier to cataloging
>
> —
> Reply to this email directly, view it on GitHub
> <#4526 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ADQ7JBBRTNZ4MGO2T4GKJS3VEIF4LANCNFSM5S5SJGUA>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
Also agree that the process of prechecking agents and having to try to
identify and remove duplicates is hugely time consuming and difficult.
Perhaps we could enable the option to "allow all duplicate preferred names
but put institution in parenthesis after it" for new agent bulkloads, and
sort them out later?
On Sat, Apr 9, 2022 at 5:23 PM Mariel Campbell ***@***.***>
wrote:
… And I would use (MSB) in that case if that were the only alternative.
On Sat, Apr 9, 2022 at 5:23 PM Mariel Campbell ***@***.***>
wrote:
> I am also not a fan of merging duplicate preferred names that are 100
> years apart and opposite sides of the country. It makes more sense to me to
> have the option of keeping preferred names separate unless we have enough
> info to merge them, rather than merging them and then trying to figure out
> how to parse them back out later.
>
> On Sat, Apr 9, 2022 at 5:18 PM Michelle Koo ***@***.***>
> wrote:
>
>> * [EXTERNAL]*
>>
>> Dealing with identical preferred names right now (and it's not a new
>> collection technically, just a common name combination). But this is
>> relevant to incoming collections especially!
>>
>> I think top priority needs to be loading records so clean up can happen
>> afterwards. Could be by multiple means including aggressive merging but
>> often times if it's not obvious right away then it will take time-consuimg
>> work to sort out people names so save that for later. Not sure if
>> notifcations would help but maybe a low-quality report?
>>
>> As for the identical preferred names, I'm unhappy with the current
>> constraint. I am not a fan of the arbitrary parenthesis solution but I will
>> resort to that with a simple (MVZ) after the name so you can see right away
>> the responsible institution (or maybe MVZ:Arch?) But this is just my thing
>> and your thing may be the (Xxxx although not sure that that is); in any
>> case, I see chaos. I dont see why not a unique constraint on something else
>> besides preferred name or a combo of the preferred name + another field
>> (which would be required to force a redundant preferred name). Or a new
>> field called fullname (then add your parenthesis if needed). It's just that
>> preferred name gets revealed elsewhere and parenthesis makes it messy
>> looking.
>>
>> to reiterate, I think the priority has to be loading as clean data as
>> possible and not have agents be the barrier to cataloging
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#4526 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ADQ7JBBRTNZ4MGO2T4GKJS3VEIF4LANCNFSM5S5SJGUA>
>> .
>> You are receiving this because you commented.Message ID:
>> ***@***.***>
>>
>
|
New issue - I don't think there's any argument that what we're doing is correct, but it does have some useful functionality which can't exist in other models. Transitioning would involve a lot more than dropping the key.
I'd much prefer "(1940s gopher collector)" - most agents that historians might care about (those with publications and field notes and such) collected stuff that's ended up in various collections.
That's a core use of verbatim collector - it's possible to delay cleanup, and move it into the context of the rest of the data, without making any messes that anyone else can see.
That certainly seems like it could be part of a valid strategy to me, but even timid approaches have met a lot of resistance from time to time. Whatever we do, it has to be done as a comprehensive plan - we can't make it super-easy to create and difficult to merge, for example. |
For Agent Committee discussion:
|
The Committee really needs to be considering the full equation, and the implications of any choices made or suggested. Arctos cannot become a cesspool of low-quality data and retain the functionality that makes us different than any other CMS (nor, I suspect, our users who rely on that functionality).
That seems slightly enthusiastic, but we should absolutely be making more use of verbatim. A user or collection can do so without giving up anything, and completely avoid entering any sort of general-scope agent quality discussions. "Upgrading" when (or if) desired is relatively easy, and would happen in the context of everything else in Arctos (vs. the traditional "jumble of bare names in a spreadsheet"). I don't think that's quite understood, and I think it drives the tone of some of these conversations.
"Less work sounds nice!" is an easy call - but that MUST be balanced by allowing easy merger. Depending on the details, that could even lead to things like automation and tighter schedules becoming necessary. Requiring a bit more research might allow us to tighten the merger functionality. Access to better tools could potentially shift that balance in some radical direction. |
OK - here is example number 2 of working on agent names. https://github.com/ArctosDB/data-migration/issues/1178 A slightly smaller bunch, but similar results. |
I think that the chances of agent name cleanup post-migration are close to nil. Our model forces people to take the time to review possible duplicate agents before they come into Arctos, and this normalization is a major selling point. If new collections are getting hung up, then I think verbatim collector is the way to go to keep things relatively clean (and hopefully this is just a portion of their agent list, with many names being exact matches to existing agents or entirely novel names to add to the agent table). |
@ebraker I think that's exactly where the agents committee meeting ended up.
The first babystep in that direction is #4554 |
OK - I think important stuff from this issue has been propagated to other issues. I'm gonna close this. |
@ArctosDB/agents-committee and anyone else who has an opinion!
I spent the last three days doing this - https://github.com/ArctosDB/data-migration/issues/660#issuecomment-1093313436
The data had already been cleaned and reviewed by collections staff, so it was pretty good to begin with, still processing through the Agent prebulkloader is a load and wait task, then reviewing the results can take a while as well. There are a lot of what I would call "personal bias" decisions made during review and my own personal bias might be different from the first hour to the seventh....
Do I spend time digging into these two names?
Michael K. Petersen (Arctos)
Michael Petersen (Collection)
At the point in time I was working on these names, they were all that I had, so I cannot look at what Michael Petersen collected or when or where they collected. To add to the complexity - the list of agents I was working with crosses all collection types (mammals, birds, herps, etc.) so I don't even have a general idea of what the collection agents are associated with. I can ask the incoming collection to dig in, but they have just spent months cleaning the list that I processed. I would like to discuss how we handle these situations so that data migration progress doesn't get bogged down in people names, BUT we continue to keep our agent list as clean as we can.
I proposed one solution in the linked issue:
Other ideas appreciated!
The text was updated successfully, but these errors were encountered: