Agents - Need help with preparing agents for incoming collections #4526

Jegelewicz · 2022-04-08T20:32:12Z

@ArctosDB/agents-committee and anyone else who has an opinion!

I spent the last three days doing this - https://github.com/ArctosDB/data-migration/issues/660#issuecomment-1093313436

The data had already been cleaned and reviewed by collections staff, so it was pretty good to begin with, still processing through the Agent prebulkloader is a load and wait task, then reviewing the results can take a while as well. There are a lot of what I would call "personal bias" decisions made during review and my own personal bias might be different from the first hour to the seventh....

Do I spend time digging into these two names?

Michael K. Petersen (Arctos)
Michael Petersen (Collection)

At the point in time I was working on these names, they were all that I had, so I cannot look at what Michael Petersen collected or when or where they collected. To add to the complexity - the list of agents I was working with crosses all collection types (mammals, birds, herps, etc.) so I don't even have a general idea of what the collection agents are associated with. I can ask the incoming collection to dig in, but they have just spent months cleaning the list that I processed. I would like to discuss how we handle these situations so that data migration progress doesn't get bogged down in people names, BUT we continue to keep our agent list as clean as we can.

I proposed one solution in the linked issue:

Add the names and do the research after the data is in Arctos

I find the second appealing as long as we actually do it because then we can look at the activity of the two close matches in Arctos and determine if they are the same person. If they are, we can merge them, if not, we can mark them "not the same as". Doing the research now means comparing collecting dates/localities/etc. between your data and whatever is in Arctos for the closely matched person, which is doable, but not as easy.

Other ideas appreciated!

campmlc · 2022-04-08T20:53:01Z

Add new name as Michael Petersen Xxxx Collection bulkloaded agent for preferred name?

…

On Fri, Apr 8, 2022, 2:32 PM Teresa Mayfield-Meyer ***@***.***> wrote: * [EXTERNAL]* @ArctosDB/agents-committee <https://github.com/orgs/ArctosDB/teams/agents-committee> and anyone else who has an opinion! I spent the last three days doing this - ArctosDB/data-migration#660 (comment) <ArctosDB/data-migration#660 (comment)> The data had already been cleaned and reviewed by collections staff, so it was pretty good to begin with, still processing through the Agent prebulkloader is a load and wait task, then reviewing the results can take a while as well. There are a lot of what I would call "personal bias" decisions made during review and my own personal bias might be different from the first hour to the seventh.... Do I spend time digging into these two names? Michael K. Petersen (Arctos) Michael Petersen (Collection) At the point in time I was working on these names, they were all that I had, so I cannot look at what Michael Petersen collected or when or where they collected. To add to the complexity - the list of agents I was working with crosses all collection types (mammals, birds, herps, etc.) so I don't even have a general ides of what the collection agents are associated with. I can ask the incoming collection to dig in, but they have just spent months cleaning the list that I processed. I would like to discuss how we handle these situations so that data migration progress doesn't get bogged down in people names, BUT we continue to keep our agent list as clean as we can. I proposed one solution in the linked issue: Add the names and do the research after the data is in Arctos I find the second appealing as long as we actually do it because then we can look at the activity of the two close matches in Arctos and determine if they are the same person. If they are, we can merge them, if not, we can mark them "not the same as". Doing the research now means comparing collecting dates/localities/etc. between your data and whatever is in Arctos for the closely matched person, which is doable, but not as easy. Other ideas appreciated! — Reply to this email directly, view it on GitHub <#4526>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADQ7JBGBJ52S6NPHYIUG27LVECJVVANCNFSM5S5SJGUA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

dustymc · 2022-04-08T20:53:30Z

This is the balance between the ease of creation and the ease of merge I bring up from time to time.

If mergers are to be difficult, you probably really need to spend a week (month, year....) on this, and I doubt anyone wants to subject themselves to any aspect of that. Given that block, I remain a huge fan of marking anything that isn't obviously different to merge, and encouraging everyone with agent access to do the same.

Add the names and do the research after the data is in Arctos

I think that's a little too coarse, but (under the viewpoint I laid out above) I wouldn't spend too much time trying to figure out Some [ initial | NULL ] Person either. (Maybe that's exactly what you said - "research" is kind of a strong word for what I generally do, but it's not just blindly dumping whatever shows up in either.)

Aggressively "merging" in the pre-load stage - strongly preferring a sorta-similar existing Agent - might be a viable approach as well, as long as the verbatim gets included with the records. (Burke took that to extremes and didn't initially load collector-agents at all, they might have a useful viewpoint in this.)

ewommack · 2022-04-08T20:59:15Z

Add new name as Michael Petersen Xxxx Collection bulkloaded agent for
preferred name?

I don't know, wouldn't that tell people we give everyone a new name every time them move to a new collection? Also what are the Xxxx standing in for?

Aggressively "merging" in the pre-load stage - strongly preferring a sorta-similar existing Agent - might be a viable approach as well, as long as the verbatim gets included with the records. (Burke took that to extremes and didn't initially load collector-agents at all, they might have a useful viewpoint in this.)

Does this mean finding as many people that are maybe the same as someone else and calling them the same person before you enter your first batch of names?

Nicole-Ridgwell-NMMNHS · 2022-04-08T22:02:55Z

I'm in favor of loading (cleaned) names and reviewing later.

Perhaps a better way of flagging names could help. A flag that is deliberately placed (not computer generated) and doesn't result in agent merger in two weeks and Arctos sends data quality contacts summary emails of their agents with these flags. So for example, UWZM:Bird loads their agents. Whoever uploads the data flags the agents that came up as "needs review" (or any others they feel they want to review). When their data is loaded they get one monthly summary email, hey, you flagged this list of agents, go review them.

dustymc · 2022-04-08T22:09:29Z

Michael Petersen Xxxx Collection bulkloaded agent

Yuck....

Does this mean

Scripts think Some New Person might be Some Person, nothing obvious suggests they're wrong, just accept the suggestion (and add verbatim collector=Some New Person to the relevant records).

not computer generated

For clarity, there are no "normal" computer-involved mergers. The occasional cleanup effort (which will still involve people) will be an Issue.

A new value in https://arctos.database.museum/info/ctDocumentation.cfm?table=ctagent_status could serve as an actionable flag, but I'm not sure anyone will review yet another report.

Nicole-Ridgwell-NMMNHS · 2022-04-08T22:40:06Z

just accept the suggestion

But doesn't that have the potential to propagate error? Ok, yes you have verbatim, and maybe a collection manager could figure out, "oh these aren't the same, let's fix that", while a less familiar user/someone from outside Arctos looking for data about people may think "huh, looks like this invertebrate paleontologist also collected squirrels, lets add that to wikidata!" or something. Maybe that's something we don't need to be concerned about? I don't know.

dustymc · 2022-04-09T00:23:33Z

potential to propagate error?

I think EVERYTHING has that potential, except maybe the "spend a year" option (and I'm not so sure about it!). So, I think the question becomes what the easiest to deal with errors look like, which is generally a choice between

There are many variations of an entity, we have thousands (89339 at the moment) of those, that quickly adds up to millions of agents, nobody can find anything and the hole just keeps getting deeper, or
Some agents get overloaded, but it's hard to avoid noticing the same agent collecting grasshoppers in Madagascar and bears in Alaska 100 years apart, so
- maybe some user splits off the grasshopper-collector (eg if they're trying to link up field notebooks or just have some familiarity with the person or something) and things get a bit better, or
- they don't, but at least they don't make anything worse and the next person still has a realistic path to making things better

I think that approach best serves the public as well. If Some Agent finds Arctos, searches Agent and gets a giant mess they'll probably just shake their head and wander off and we'll never hear from them again. If the same search finds all of their stuff and maybe some extras, they might think we're trying and be inclined to tell us what's not theirs.

If we must have messes - and we probably must - then we should strive for manageable messes.

campmlc · 2022-04-09T02:01:03Z

maybe some user splits off the grasshopper-collector So I'm still unclear on the best way to do this if preferred names are identical?

…

On Fri, Apr 8, 2022, 6:23 PM dustymc ***@***.***> wrote: * [EXTERNAL]* potential to propagate error? I think EVERYTHING has that potential, except maybe the "spend a year" option (and I'm not so sure about it!). So, I think the question becomes what the easiest to deal with errors look like, which is generally a choice between 1. There are many variations of an entity, we have thousands (89339 at the moment) of those, that quickly adds up to millions of agents, nobody can find anything and the hole just keeps getting deeper, or 2. Some agents get overloaded, but it's hard to avoid noticing the same agent collecting grasshoppers in Madagascar and bears in Alaska 100 years apart, so - maybe some user splits off the grasshopper-collector (eg if they're trying to link up field notebooks or just have some familiarity with the person or something) and things get a bit better, or - they don't, but at least they don't make anything worse and the next person still has a realistic path to making things better I think that approach best serves the public as well. If Some Agent finds Arctos, searches Agent and gets a giant mess they'll probably just shake their head and wander off and we'll never hear from them again. If the same search finds all of their stuff and maybe some extras, they might think we're trying and be inclined to tell us what's not theirs. If we must have messes - and we probably must - then we should strive for manageable messes. — Reply to this email directly, view it on GitHub <#4526 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADQ7JBCXYK4QCXGC42CEWKDVEDEY7ANCNFSM5S5SJGUA> . You are receiving this because you commented.Message ID: ***@***.***>

mkoo · 2022-04-09T23:18:17Z

Dealing with identical preferred names right now (and it's not a new collection technically, just a common name combination). But this is relevant to incoming collections especially!

I think top priority needs to be loading records so clean up can happen afterwards. Could be by multiple means including aggressive merging but often times if it's not obvious right away then it will take time-consuimg work to sort out people names so save that for later. Not sure if notifcations would help but maybe a low-quality report?

As for the identical preferred names, I'm unhappy with the current constraint. I am not a fan of the arbitrary parenthesis solution but I will resort to that with a simple (MVZ) after the name so you can see right away the responsible institution (or maybe MVZ:Arch?) But this is just my thing and your thing may be the (Xxxx although not sure that that is); in any case, I see chaos. I dont see why not a unique constraint on something else besides preferred name or a combo of the preferred name + another field (which would be required to force a redundant preferred name). Or a new field called fullname (then add your parenthesis if needed). It's just that preferred name gets revealed elsewhere and parenthesis makes it messy looking.

to reiterate, I think the priority has to be loading as clean data as possible and not have agents be the barrier to cataloging

campmlc · 2022-04-09T23:23:30Z

I am also not a fan of merging duplicate preferred names that are 100 years apart and opposite sides of the country. It makes more sense to me to have the option of keeping preferred names separate unless we have enough info to merge them, rather than merging them and then trying to figure out how to parse them back out later.

…

On Sat, Apr 9, 2022 at 5:18 PM Michelle Koo ***@***.***> wrote: * [EXTERNAL]* Dealing with identical preferred names right now (and it's not a new collection technically, just a common name combination). But this is relevant to incoming collections especially! I think top priority needs to be loading records so clean up can happen afterwards. Could be by multiple means including aggressive merging but often times if it's not obvious right away then it will take time-consuimg work to sort out people names so save that for later. Not sure if notifcations would help but maybe a low-quality report? As for the identical preferred names, I'm unhappy with the current constraint. I am not a fan of the arbitrary parenthesis solution but I will resort to that with a simple (MVZ) after the name so you can see right away the responsible institution (or maybe MVZ:Arch?) But this is just my thing and your thing may be the (Xxxx although not sure that that is); in any case, I see chaos. I dont see why not a unique constraint on something else besides preferred name or a combo of the preferred name + another field (which would be required to force a redundant preferred name). Or a new field called fullname (then add your parenthesis if needed). It's just that preferred name gets revealed elsewhere and parenthesis makes it messy looking. to reiterate, I think the priority has to be loading as clean data as possible and not have agents be the barrier to cataloging — Reply to this email directly, view it on GitHub <#4526 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADQ7JBBRTNZ4MGO2T4GKJS3VEIF4LANCNFSM5S5SJGUA> . You are receiving this because you commented.Message ID: ***@***.***>

campmlc · 2022-04-09T23:24:06Z

And I would use (MSB) in that case if that were the only alternative. On Sat, Apr 9, 2022 at 5:23 PM Mariel Campbell ***@***.***> wrote:

…

I am also not a fan of merging duplicate preferred names that are 100 years apart and opposite sides of the country. It makes more sense to me to have the option of keeping preferred names separate unless we have enough info to merge them, rather than merging them and then trying to figure out how to parse them back out later. On Sat, Apr 9, 2022 at 5:18 PM Michelle Koo ***@***.***> wrote: > * [EXTERNAL]* > > Dealing with identical preferred names right now (and it's not a new > collection technically, just a common name combination). But this is > relevant to incoming collections especially! > > I think top priority needs to be loading records so clean up can happen > afterwards. Could be by multiple means including aggressive merging but > often times if it's not obvious right away then it will take time-consuimg > work to sort out people names so save that for later. Not sure if > notifcations would help but maybe a low-quality report? > > As for the identical preferred names, I'm unhappy with the current > constraint. I am not a fan of the arbitrary parenthesis solution but I will > resort to that with a simple (MVZ) after the name so you can see right away > the responsible institution (or maybe MVZ:Arch?) But this is just my thing > and your thing may be the (Xxxx although not sure that that is); in any > case, I see chaos. I dont see why not a unique constraint on something else > besides preferred name or a combo of the preferred name + another field > (which would be required to force a redundant preferred name). Or a new > field called fullname (then add your parenthesis if needed). It's just that > preferred name gets revealed elsewhere and parenthesis makes it messy > looking. > > to reiterate, I think the priority has to be loading as clean data as > possible and not have agents be the barrier to cataloging > > — > Reply to this email directly, view it on GitHub > <#4526 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ADQ7JBBRTNZ4MGO2T4GKJS3VEIF4LANCNFSM5S5SJGUA> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

campmlc · 2022-04-09T23:28:43Z

Also agree that the process of prechecking agents and having to try to identify and remove duplicates is hugely time consuming and difficult. Perhaps we could enable the option to "allow all duplicate preferred names but put institution in parenthesis after it" for new agent bulkloads, and sort them out later? On Sat, Apr 9, 2022 at 5:23 PM Mariel Campbell ***@***.***> wrote:

…

And I would use (MSB) in that case if that were the only alternative. On Sat, Apr 9, 2022 at 5:23 PM Mariel Campbell ***@***.***> wrote: > I am also not a fan of merging duplicate preferred names that are 100 > years apart and opposite sides of the country. It makes more sense to me to > have the option of keeping preferred names separate unless we have enough > info to merge them, rather than merging them and then trying to figure out > how to parse them back out later. > > On Sat, Apr 9, 2022 at 5:18 PM Michelle Koo ***@***.***> > wrote: > >> * [EXTERNAL]* >> >> Dealing with identical preferred names right now (and it's not a new >> collection technically, just a common name combination). But this is >> relevant to incoming collections especially! >> >> I think top priority needs to be loading records so clean up can happen >> afterwards. Could be by multiple means including aggressive merging but >> often times if it's not obvious right away then it will take time-consuimg >> work to sort out people names so save that for later. Not sure if >> notifcations would help but maybe a low-quality report? >> >> As for the identical preferred names, I'm unhappy with the current >> constraint. I am not a fan of the arbitrary parenthesis solution but I will >> resort to that with a simple (MVZ) after the name so you can see right away >> the responsible institution (or maybe MVZ:Arch?) But this is just my thing >> and your thing may be the (Xxxx although not sure that that is); in any >> case, I see chaos. I dont see why not a unique constraint on something else >> besides preferred name or a combo of the preferred name + another field >> (which would be required to force a redundant preferred name). Or a new >> field called fullname (then add your parenthesis if needed). It's just that >> preferred name gets revealed elsewhere and parenthesis makes it messy >> looking. >> >> to reiterate, I think the priority has to be loading as clean data as >> possible and not have agents be the barrier to cataloging >> >> — >> Reply to this email directly, view it on GitHub >> <#4526 (comment)>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ADQ7JBBRTNZ4MGO2T4GKJS3VEIF4LANCNFSM5S5SJGUA> >> . >> You are receiving this because you commented.Message ID: >> ***@***.***> >> >

dustymc · 2022-04-10T16:06:46Z

identical preferred names

New issue - I don't think there's any argument that what we're doing is correct, but it does have some useful functionality which can't exist in other models. Transitioning would involve a lot more than dropping the key.

(MVZ)

I'd much prefer "(1940s gopher collector)" - most agents that historians might care about (those with publications and field notes and such) collected stuff that's ended up in various collections.

top priority needs to be loading records so clean up can happen afterwards.

That's a core use of verbatim collector - it's possible to delay cleanup, and move it into the context of the rest of the data, without making any messes that anyone else can see.

aggressive merging

That certainly seems like it could be part of a valid strategy to me, but even timid approaches have met a lot of resistance from time to time. Whatever we do, it has to be done as a comprehensive plan - we can't make it super-easy to create and difficult to merge, for example.

Jegelewicz · 2022-04-12T23:02:35Z

For Agent Committee discussion:

Load single name agents?
How much research should incoming collections have to do?
Hot take...should any agent that has nothing but names be moved to "verbatim"?

dustymc · 2022-04-13T14:47:55Z

The Committee really needs to be considering the full equation, and the implications of any choices made or suggested. Arctos cannot become a cesspool of low-quality data and retain the functionality that makes us different than any other CMS (nor, I suspect, our users who rely on that functionality).

should any agent that has nothing but names be moved to "verbatim"

That seems slightly enthusiastic, but we should absolutely be making more use of verbatim. A user or collection can do so without giving up anything, and completely avoid entering any sort of general-scope agent quality discussions. "Upgrading" when (or if) desired is relatively easy, and would happen in the context of everything else in Arctos (vs. the traditional "jumble of bare names in a spreadsheet"). I don't think that's quite understood, and I think it drives the tone of some of these conversations.

How much research

"Less work sounds nice!" is an easy call - but that MUST be balanced by allowing easy merger. Depending on the details, that could even lead to things like automation and tighter schedules becoming necessary. Requiring a bit more research might allow us to tighten the merger functionality. Access to better tools could potentially shift that balance in some radical direction.

Jegelewicz · 2022-04-13T15:47:32Z

OK - here is example number 2 of working on agent names.

https://github.com/ArctosDB/data-migration/issues/1178

A slightly smaller bunch, but similar results.

ebraker · 2022-04-13T21:59:12Z

I think that the chances of agent name cleanup post-migration are close to nil. Our model forces people to take the time to review possible duplicate agents before they come into Arctos, and this normalization is a major selling point. If new collections are getting hung up, then I think verbatim collector is the way to go to keep things relatively clean (and hopefully this is just a portion of their agent list, with many names being exact matches to existing agents or entirely novel names to add to the agent table).

dustymc · 2022-04-13T22:17:54Z

@ebraker I think that's exactly where the agents committee meeting ended up.

do more - maybe much more - with verbatim collector; create a lot fewer agents (and clean up existing)
add some constraints on agents - any agent who can't be entered as a verbatim collector should require SOMETHING - status, addresses, relationships, whatever. I don't think we have anything specific in mind yet, but somehow require agents to be more than strings-with-complications.
build tools - whatever they'll eventually look like - to help "agentify" verbatim collectors when more becomes known.

The first babystep in that direction is #4554

Jegelewicz · 2022-04-14T15:53:34Z

OK - I think important stuff from this issue has been propagated to other issues. I'm gonna close this.

Jegelewicz added Priority-High (Needed for work) High because this is causing a delay in important collection work.. Function-Agents Administrative How the community functions - these issues may be transferred to internal repos labels Apr 8, 2022

Jegelewicz added this to the Needs Discussion milestone Apr 8, 2022

Jegelewicz mentioned this issue Apr 11, 2022

Use of preferred name as unique key for agent table #4534

Closed

dustymc mentioned this issue Apr 13, 2022

Feature Request - Put in Bionomia links to Agent Profiles #4548

Closed

Jegelewicz closed this as completed Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agents - Need help with preparing agents for incoming collections #4526

Agents - Need help with preparing agents for incoming collections #4526

Jegelewicz commented Apr 8, 2022 •

edited

Loading

campmlc commented Apr 8, 2022 via email

dustymc commented Apr 8, 2022

ewommack commented Apr 8, 2022

Nicole-Ridgwell-NMMNHS commented Apr 8, 2022

dustymc commented Apr 8, 2022

Nicole-Ridgwell-NMMNHS commented Apr 8, 2022

dustymc commented Apr 9, 2022

campmlc commented Apr 9, 2022 via email

mkoo commented Apr 9, 2022

campmlc commented Apr 9, 2022 via email

campmlc commented Apr 9, 2022 via email

campmlc commented Apr 9, 2022 via email

dustymc commented Apr 10, 2022

Jegelewicz commented Apr 12, 2022

dustymc commented Apr 13, 2022

Jegelewicz commented Apr 13, 2022

ebraker commented Apr 13, 2022

dustymc commented Apr 13, 2022

Jegelewicz commented Apr 14, 2022

Agents - Need help with preparing agents for incoming collections #4526

Agents - Need help with preparing agents for incoming collections #4526

Comments

Jegelewicz commented Apr 8, 2022 • edited Loading

campmlc commented Apr 8, 2022 via email

dustymc commented Apr 8, 2022

ewommack commented Apr 8, 2022

Nicole-Ridgwell-NMMNHS commented Apr 8, 2022

dustymc commented Apr 8, 2022

Nicole-Ridgwell-NMMNHS commented Apr 8, 2022

dustymc commented Apr 9, 2022

campmlc commented Apr 9, 2022 via email

mkoo commented Apr 9, 2022

campmlc commented Apr 9, 2022 via email

campmlc commented Apr 9, 2022 via email

campmlc commented Apr 9, 2022 via email

dustymc commented Apr 10, 2022

Jegelewicz commented Apr 12, 2022

dustymc commented Apr 13, 2022

Jegelewicz commented Apr 13, 2022

ebraker commented Apr 13, 2022

dustymc commented Apr 13, 2022

Jegelewicz commented Apr 14, 2022

Jegelewicz commented Apr 8, 2022 •

edited

Loading