Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents - Need help with preparing agents for incoming collections #4526

Closed
Jegelewicz opened this issue Apr 8, 2022 · 19 comments
Closed

Agents - Need help with preparing agents for incoming collections #4526

Jegelewicz opened this issue Apr 8, 2022 · 19 comments
Labels
Administrative How the community functions - these issues may be transferred to internal repos Function-Agents Priority-High (Needed for work) High because this is causing a delay in important collection work..

Comments

@Jegelewicz
Copy link
Member

Jegelewicz commented Apr 8, 2022

@ArctosDB/agents-committee and anyone else who has an opinion!

I spent the last three days doing this - https://github.com/ArctosDB/data-migration/issues/660#issuecomment-1093313436

The data had already been cleaned and reviewed by collections staff, so it was pretty good to begin with, still processing through the Agent prebulkloader is a load and wait task, then reviewing the results can take a while as well. There are a lot of what I would call "personal bias" decisions made during review and my own personal bias might be different from the first hour to the seventh....

Do I spend time digging into these two names?

Michael K. Petersen (Arctos)
Michael Petersen (Collection)

At the point in time I was working on these names, they were all that I had, so I cannot look at what Michael Petersen collected or when or where they collected. To add to the complexity - the list of agents I was working with crosses all collection types (mammals, birds, herps, etc.) so I don't even have a general idea of what the collection agents are associated with. I can ask the incoming collection to dig in, but they have just spent months cleaning the list that I processed. I would like to discuss how we handle these situations so that data migration progress doesn't get bogged down in people names, BUT we continue to keep our agent list as clean as we can.

I proposed one solution in the linked issue:

Add the names and do the research after the data is in Arctos

I find the second appealing as long as we actually do it because then we can look at the activity of the two close matches in Arctos and determine if they are the same person. If they are, we can merge them, if not, we can mark them "not the same as". Doing the research now means comparing collecting dates/localities/etc. between your data and whatever is in Arctos for the closely matched person, which is doable, but not as easy.

Other ideas appreciated!

@Jegelewicz Jegelewicz added Priority-High (Needed for work) High because this is causing a delay in important collection work.. Function-Agents Administrative How the community functions - these issues may be transferred to internal repos labels Apr 8, 2022
@Jegelewicz Jegelewicz added this to the Needs Discussion milestone Apr 8, 2022
@campmlc
Copy link

campmlc commented Apr 8, 2022 via email

@dustymc
Copy link
Contributor

dustymc commented Apr 8, 2022

This is the balance between the ease of creation and the ease of merge I bring up from time to time.

If mergers are to be difficult, you probably really need to spend a week (month, year....) on this, and I doubt anyone wants to subject themselves to any aspect of that. Given that block, I remain a huge fan of marking anything that isn't obviously different to merge, and encouraging everyone with agent access to do the same.

Add the names and do the research after the data is in Arctos

I think that's a little too coarse, but (under the viewpoint I laid out above) I wouldn't spend too much time trying to figure out Some [ initial | NULL ] Person either. (Maybe that's exactly what you said - "research" is kind of a strong word for what I generally do, but it's not just blindly dumping whatever shows up in either.)

Aggressively "merging" in the pre-load stage - strongly preferring a sorta-similar existing Agent - might be a viable approach as well, as long as the verbatim gets included with the records. (Burke took that to extremes and didn't initially load collector-agents at all, they might have a useful viewpoint in this.)

@ewommack
Copy link

ewommack commented Apr 8, 2022

Add new name as Michael Petersen Xxxx Collection bulkloaded agent for
preferred name?

I don't know, wouldn't that tell people we give everyone a new name every time them move to a new collection? Also what are the Xxxx standing in for?

Aggressively "merging" in the pre-load stage - strongly preferring a sorta-similar existing Agent - might be a viable approach as well, as long as the verbatim gets included with the records. (Burke took that to extremes and didn't initially load collector-agents at all, they might have a useful viewpoint in this.)

Does this mean finding as many people that are maybe the same as someone else and calling them the same person before you enter your first batch of names?

@Nicole-Ridgwell-NMMNHS
Copy link

I'm in favor of loading (cleaned) names and reviewing later.

Perhaps a better way of flagging names could help. A flag that is deliberately placed (not computer generated) and doesn't result in agent merger in two weeks and Arctos sends data quality contacts summary emails of their agents with these flags. So for example, UWZM:Bird loads their agents. Whoever uploads the data flags the agents that came up as "needs review" (or any others they feel they want to review). When their data is loaded they get one monthly summary email, hey, you flagged this list of agents, go review them.

@dustymc
Copy link
Contributor

dustymc commented Apr 8, 2022

Michael Petersen Xxxx Collection bulkloaded agent

Yuck....

Does this mean

Scripts think Some New Person might be Some Person, nothing obvious suggests they're wrong, just accept the suggestion (and add verbatim collector=Some New Person to the relevant records).

not computer generated

For clarity, there are no "normal" computer-involved mergers. The occasional cleanup effort (which will still involve people) will be an Issue.

A new value in https://arctos.database.museum/info/ctDocumentation.cfm?table=ctagent_status could serve as an actionable flag, but I'm not sure anyone will review yet another report.

@Nicole-Ridgwell-NMMNHS
Copy link

just accept the suggestion

But doesn't that have the potential to propagate error? Ok, yes you have verbatim, and maybe a collection manager could figure out, "oh these aren't the same, let's fix that", while a less familiar user/someone from outside Arctos looking for data about people may think "huh, looks like this invertebrate paleontologist also collected squirrels, lets add that to wikidata!" or something. Maybe that's something we don't need to be concerned about? I don't know.

@dustymc
Copy link
Contributor

dustymc commented Apr 9, 2022

potential to propagate error?

I think EVERYTHING has that potential, except maybe the "spend a year" option (and I'm not so sure about it!). So, I think the question becomes what the easiest to deal with errors look like, which is generally a choice between

  1. There are many variations of an entity, we have thousands (89339 at the moment) of those, that quickly adds up to millions of agents, nobody can find anything and the hole just keeps getting deeper, or
  2. Some agents get overloaded, but it's hard to avoid noticing the same agent collecting grasshoppers in Madagascar and bears in Alaska 100 years apart, so
    • maybe some user splits off the grasshopper-collector (eg if they're trying to link up field notebooks or just have some familiarity with the person or something) and things get a bit better, or
    • they don't, but at least they don't make anything worse and the next person still has a realistic path to making things better

I think that approach best serves the public as well. If Some Agent finds Arctos, searches Agent and gets a giant mess they'll probably just shake their head and wander off and we'll never hear from them again. If the same search finds all of their stuff and maybe some extras, they might think we're trying and be inclined to tell us what's not theirs.

If we must have messes - and we probably must - then we should strive for manageable messes.

@campmlc
Copy link

campmlc commented Apr 9, 2022 via email

@mkoo
Copy link
Member

mkoo commented Apr 9, 2022

Dealing with identical preferred names right now (and it's not a new collection technically, just a common name combination). But this is relevant to incoming collections especially!

I think top priority needs to be loading records so clean up can happen afterwards. Could be by multiple means including aggressive merging but often times if it's not obvious right away then it will take time-consuimg work to sort out people names so save that for later. Not sure if notifcations would help but maybe a low-quality report?

As for the identical preferred names, I'm unhappy with the current constraint. I am not a fan of the arbitrary parenthesis solution but I will resort to that with a simple (MVZ) after the name so you can see right away the responsible institution (or maybe MVZ:Arch?) But this is just my thing and your thing may be the (Xxxx although not sure that that is); in any case, I see chaos. I dont see why not a unique constraint on something else besides preferred name or a combo of the preferred name + another field (which would be required to force a redundant preferred name). Or a new field called fullname (then add your parenthesis if needed). It's just that preferred name gets revealed elsewhere and parenthesis makes it messy looking.

to reiterate, I think the priority has to be loading as clean data as possible and not have agents be the barrier to cataloging

@campmlc
Copy link

campmlc commented Apr 9, 2022 via email

@campmlc
Copy link

campmlc commented Apr 9, 2022 via email

@campmlc
Copy link

campmlc commented Apr 9, 2022 via email

@dustymc
Copy link
Contributor

dustymc commented Apr 10, 2022

identical preferred names

New issue - I don't think there's any argument that what we're doing is correct, but it does have some useful functionality which can't exist in other models. Transitioning would involve a lot more than dropping the key.

(MVZ)

I'd much prefer "(1940s gopher collector)" - most agents that historians might care about (those with publications and field notes and such) collected stuff that's ended up in various collections.

top priority needs to be loading records so clean up can happen afterwards.

That's a core use of verbatim collector - it's possible to delay cleanup, and move it into the context of the rest of the data, without making any messes that anyone else can see.

aggressive merging

That certainly seems like it could be part of a valid strategy to me, but even timid approaches have met a lot of resistance from time to time. Whatever we do, it has to be done as a comprehensive plan - we can't make it super-easy to create and difficult to merge, for example.

@Jegelewicz
Copy link
Member Author

For Agent Committee discussion:

  1. Load single name agents?
  2. How much research should incoming collections have to do?
  3. Hot take...should any agent that has nothing but names be moved to "verbatim"?

@dustymc
Copy link
Contributor

dustymc commented Apr 13, 2022

The Committee really needs to be considering the full equation, and the implications of any choices made or suggested. Arctos cannot become a cesspool of low-quality data and retain the functionality that makes us different than any other CMS (nor, I suspect, our users who rely on that functionality).

should any agent that has nothing but names be moved to "verbatim"

That seems slightly enthusiastic, but we should absolutely be making more use of verbatim. A user or collection can do so without giving up anything, and completely avoid entering any sort of general-scope agent quality discussions. "Upgrading" when (or if) desired is relatively easy, and would happen in the context of everything else in Arctos (vs. the traditional "jumble of bare names in a spreadsheet"). I don't think that's quite understood, and I think it drives the tone of some of these conversations.

How much research

"Less work sounds nice!" is an easy call - but that MUST be balanced by allowing easy merger. Depending on the details, that could even lead to things like automation and tighter schedules becoming necessary. Requiring a bit more research might allow us to tighten the merger functionality. Access to better tools could potentially shift that balance in some radical direction.

@Jegelewicz
Copy link
Member Author

OK - here is example number 2 of working on agent names.

https://github.com/ArctosDB/data-migration/issues/1178

A slightly smaller bunch, but similar results.

@ebraker
Copy link
Contributor

ebraker commented Apr 13, 2022

I think that the chances of agent name cleanup post-migration are close to nil. Our model forces people to take the time to review possible duplicate agents before they come into Arctos, and this normalization is a major selling point. If new collections are getting hung up, then I think verbatim collector is the way to go to keep things relatively clean (and hopefully this is just a portion of their agent list, with many names being exact matches to existing agents or entirely novel names to add to the agent table).

@dustymc
Copy link
Contributor

dustymc commented Apr 13, 2022

@ebraker I think that's exactly where the agents committee meeting ended up.

  • do more - maybe much more - with verbatim collector; create a lot fewer agents (and clean up existing)
  • add some constraints on agents - any agent who can't be entered as a verbatim collector should require SOMETHING - status, addresses, relationships, whatever. I don't think we have anything specific in mind yet, but somehow require agents to be more than strings-with-complications.
  • build tools - whatever they'll eventually look like - to help "agentify" verbatim collectors when more becomes known.

The first babystep in that direction is #4554

@Jegelewicz
Copy link
Member Author

OK - I think important stuff from this issue has been propagated to other issues. I'm gonna close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Administrative How the community functions - these issues may be transferred to internal repos Function-Agents Priority-High (Needed for work) High because this is causing a delay in important collection work..
Projects
None yet
Development

No branches or pull requests

7 participants