Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop low-information agents, do more with verbatim agents #4554

Closed
dustymc opened this issue Apr 13, 2022 · 110 comments
Closed

stop low-information agents, do more with verbatim agents #4554

dustymc opened this issue Apr 13, 2022 · 110 comments
Labels
Enhancement I think this would make Arctos even awesomer! Priority-Critical (Arctos is broken) Critical because it is breaking functionality.

Comments

@dustymc
Copy link
Contributor

dustymc commented Apr 13, 2022

Is your feature request related to a problem? Please describe.

We have a lot of low-data agents, they make everything in agent land more difficult than it needs to be.

Describe what you're trying to accomplish

Better data, less work.

Describe the solution you'd like

  1. Policy: don't make low-information agents, use verbatim collector instead. Require some information (address, relationship, status date) for agents that do more than 'collector' stuff.
  2. Clean up existing agents to follow that policy
  3. Tools
    • report of verbatim collectors by collection (SQL below)
    • report of verbatim collectors from catalog record results
    • tools to do whatever else is missing under this approach

Describe alternatives you've considered

Much work, bad data.

Additional context

First Step: report of low information agents who don't have addresses or relationships and don't extend beyond table collector.

Priority

High, the problem gets worse with every new collection.

EDIT: the promised SQL

select attribute_value, count(*) c from 
cataloged_item
inner join collection on cataloged_item.collection_id=collection.collection_id
inner join attributes on cataloged_item.collection_object_id=attributes.collection_object_id and attribute_type='verbatim collector'
where guid_prefix='CHAS:Mamm'
group by attribute_value order by attribute_value

Just change the CHAS:Mamm of where guid_prefix='CHAS:Mamm' to an approriate value for other collections. Values can be found on https://arctos.database.museum/home.cfm.

@dustymc dustymc added Priority-Critical (Arctos is broken) Critical because it is breaking functionality. Enhancement I think this would make Arctos even awesomer! labels Apr 13, 2022
@dustymc dustymc added this to the Active Development milestone Apr 13, 2022
@dustymc
Copy link
Contributor Author

dustymc commented Apr 13, 2022

First pass: Attached are 1883 agents who have either one-word or initials preferred names, and who are not found outside of table collector.

Proposal:

  1. add these to each relevant catalog record as
    • attribute: verbatim collector
    • value: {preferred_agent_name from the attached CSV}
    • method: {collector_role}
    • remark: {for_attr_remarks from the attached CSV}
  2. Remove them from table collector and delete the agent records

temp_agent_clean_first.csv.zip

I'll proceed (using fresh data) if there are no objections by 2022-04-27. whenever the conversation draws down.

@sharpphyl
Copy link

Please retain 21263988 | Sanbornes

@dustymc
Copy link
Contributor Author

dustymc commented Apr 14, 2022

Please retain

If we proceed with this, that would be a matter of data. Maybe we'll be able to see through the clutter enough to build better rules at some point, but for now just about anything would escape the filters I'm working with. Address=South Pacific, alive=1972, WHATEVER. We'd like to have a bar, but at least initially it'll be a very low bar!

Some remark suggests they should be involved in an accession - that would stop this, but hopefully only temporarily.

Agent remarks suggest a name that might lead somewhere and the activity suggests one person, why not just use that and put the uncertainty in the remarks? Maybe we also need some sort of Best Practices document (or the existing cleaned up or added to) - "when given X, we suggest doing Y...."

Unrelated to agents, some other remark makes me suspect this wasn't collected after 1973, and I'm absolutely positive it wasn't collected tomorrow - event dates could be tightened up a LOT (but not as much as they could have been yesterday...).

@Jegelewicz
Copy link
Member

We need to make a pass through this because this one

21313587 | ᑭ�� | first name=ᑭ��|aka=Kigai; Remark: Ethnology and History verbatim agent; carver

probably needs to be kept as is

@AJLinn
Copy link

AJLinn commented Apr 14, 2022

OBJECTION!
Please don't delete anything yet ... but I should be able to get my agents clarified by the time you proceed.
That said, as I go thru the list (i'm ever so glad I put my collection in the agent remarks field!) most of my single name individuals fall into one of two categories:

  1. An Indigenous artist (creator) who is known by only one name. Many of these artists made these items prior to converting to Christianity and therefore did not have a "surname" in the way modern people conceive of "proper" name format. (E.g. 21280156 / Qinaqtaq --> an extremely famous Iñupiaq artist who was the creator of baleen baskets in the first decade of the 1900s and is referenced in a number of peer reviewed publications and books). If we proceed with a blanket rule to eliminate these records or move them to verbatim agent we are being prejudiced against cultures who do not follow our same concepts of names, and thus we will cause people to miss discovering objects in our collection. Admittedly, I have been inconsistent with whether I enter their single name as a first name or a last name. I'd be happy to fix those so they are either first or last name consistently.
  2. A manufacturer name that is a single name. Our protocol has been to have the preferred name written as the name physically appears on a manufactured item. (e.g., 21300649 / Spanjian -->a sports uniform manufacturer in the mid-20th century; I just added an aka with the Spanjian Sportswear).

I have argued in the past for both of these types of single named agents to not be deleted or flagged as somehow "less valid" (i.e., moved to verbatim collector) than a record with more than one name. I will fight all night long to defend the single name Indigenous creator record. I will also defend the use of the name that is printed on the label as the preferred name, but will encourage our staff (including myself) to do a better job of finding the full corporate name, if it exists online).

[I'll now get down from my soapbox...]

@Jegelewicz
Copy link
Member

@AJLinn brings up a few good points

  • single term agents of type organization should be allowed without issue
  • single term agents of type person should be fine IF they include at least one relationship, address, or status (@AJLinn just create a relationship to your organization (associate of) instead of or along with remarks and that should cover it)

@dustymc
Copy link
Contributor Author

dustymc commented Apr 14, 2022

The format of the agent name isn't in any way the problem, it's just a convenient place to start. This should eventually involves ALL agent names; they're still just strings, even (maybe especially!) if there are 17 "words" involved.

I should be able to get my agents clarified

Please let me know if there's any way I can help - pull data out, put it in, WHATEVER. If this comes down to one-by-one it may never get finished. (But it got started and we're thinking about this stuff and that's something!)

in the first decade of the 1900s

Great, add that (or the publications or whatever) and the agent easily clears this bar.

manufacturer

Ditto. (And bigger picture, it seems we're going to be forced past our unique preferred name restriction at some point, which would be a lot more approachable if we could tell the Nike in Oregon from the Nike from Greece.)

somehow "less valid"

See above, these are just a convenient place to start. I can drop this and grab a couple thousand random or something if the format is a distraction.

also defend the use of the name that is printed on the label as the preferred name

That is embedded in the "forced past unique restriction" mentioned above. Doing that and avoiding the absolute most disrespectful thing we could do - not properly attributing work to the creators - is the core of this; right now, if both Nikes show up and (reasonably) demand we use their name, we just can't. If we somehow allow two Nikes, we can't tell them apart (except maybe by digging through remarks, which isn't realistic) which leads to us attributing god-stuff to the shoe-folks. We need more data to move past our restrictions.

full corporate name

Please note that more names won't stop this (or that's how I hope it plays out, anyway). This is fundamentally a request for some sort of actual data beyond strings/names. The ideal form of that is something which leads to a lot more data - a ORCID/WikiData/LoC/whatever address - but the bar isn't that high (yet?? Probably never...) and a vague address (Canada) or status date (alive in 1905) will (we so hope) meet the foreseeable needs.

I was going to refer to documentation - much of the requested information exists, but not in such a way that machines (or humans, unless they're willing to dig) can find it, but the current documentation is not clear. @Jegelewicz the remarks section of https://handbook.arctosdb.org/best_practices/Agents.html#general-recommendations-for-creating-meaningful-agents should look more like https://github.com/ArctosDB/documentation-wiki/blob/ee9493ba951cb64639eb0e97fb51b5e909871c01/_documentation/agent.markdown - "Use remarks as a last resort" is the critical (and now missing) idea.

From the CSV:

Remark: UAM ethnology & history; sports uniform manufacturer in mid-20th century; moved from Pasadena to San Marcos, CA in 1971.

I copied some of that to appropriate places:

Screen Shot 2022-04-14 at 7 14 53 AM

And now we have TWO non-name-based data points! There might be another 500 Spanjians out there, maybe even making Sportswear, and as long as they're not operating in San Marcos in 1971 they can't confuse anyone!

Now I'm gonna go file an issue about the values I had to use...

@Jegelewicz
Copy link
Member

the remarks section of https://handbook.arctosdb.org/best_practices/Agents.html#general-recommendations-for-creating-meaningful-agents should look more like https://github.com/ArctosDB/documentation-wiki/blob/ee9493ba951cb64639eb0e97fb51b5e909871c01/_documentation/agent.markdown - "Use remarks as a last resort" is the critical (and now missing) idea.

moved remarks stuff to Don't

Jegelewicz added a commit to ArctosDB/documentation-wiki that referenced this issue Apr 14, 2022
@dustymc
Copy link
Contributor Author

dustymc commented Apr 14, 2022

21313587 | ᑭ�� | first name=ᑭ��|aka=Kigai; Remark: Ethnology and History verbatim agent; carver
probably needs to be kept as is

ᑭᒐᐃ is acting as a creator, I think it's safe to assume they were at the creation event which carries places and dates. I don't want to get into some tail wagging the dog situation so I'm (extremely) hesitant to just make those assertions, but I could round them up for human review (and help load anything which passes that).

The other viewpoint is that ᑭᒐᐃ is functionally nothing but a string stored in a complicated way at the moment, changing that to a string stored in a less-complicated structure doesn't change any meaning or function that I can identify. At some point hopefully someone will "elevate" some/many/most "simple string agents" to agent objects (because they want to do something that requires the complexity, not "just because" - I hope), and I'm happy to build tools to facilitate, I just need a use case. (I don't think we're missing any functionality now, but I can probably save some clicking.)

Note also that this approach would unavoidably allow what we're really trying to get rid of. If for some reason someone wants to scrounge up data for T. K. (who seems to be no more than a footnote in an obscure publication), then doing so would put them in the "safe pile" along with any other more-than-strings agent. I'm not sure if that's a feature or a bug, but it's probably unavoidable under this viewpoint.

@ebraker
Copy link
Contributor

ebraker commented Apr 14, 2022

@dusty is it possible to get a csv or SQL for UCM records using values from temp_agent_clean_first.csv.zip? That way I can more easily take a pass at reviewing and adding more agent info when possible.

@dustymc
Copy link
Contributor Author

dustymc commented Apr 14, 2022

I did this

select string_agg(guid,',') from (
    select concat(guid_prefix,':',cat_num) as guid from cataloged_item
    inner join collection on cataloged_item.collection_id=collection.collection_id
    inner join collector on cataloged_item.collection_object_id=collector.collection_object_id
    inner join  temp_agent_clean_first on  temp_agent_clean_first.agent_id=collector.agent_id
    where guid_prefix like 'UCM:%'
) x

but the result is a bit awkward to pass around so https://arctos.database.museum/archive/ucm_issue_4554 - let me know if you need something else.

@dustymc dustymc changed the title stop low-informtion agents, do more with verbatim agents stop low-information agents, do more with verbatim agents Apr 14, 2022
@AJLinn
Copy link

AJLinn commented Apr 15, 2022

"Use remarks as a last resort" is the critical (and now missing) idea.

I actually really disagree with this idea, unless we instead add a free text field called biographical profile or biographical summary. This is essential, useful data that helps distinguish one John Smith from another, it shows up in our agent summary, and is critical for understanding the context of our collections.

Compare our agent record for Robert Bloom to that of the UAF Archives (which is a short one also):
Screen Shot 2022-04-14 at 4 22 18 PM

It's easier and more useful than creating a PDF of a biographical profile and attaching it as a media file to the agent record... more clicks and downloads.

We already allowed for markdown formatting for paragraphs of text, so the agent summary page looks better when there's more there.

just create a relationship to your organization (associate of) instead of or along with remarks and that should cover it

I'm not sure this is an appropriate way to "claim" that agent. I'd prefer to add some born/alive/died/dead data, some geographic information in an address field, or additional biographical info if it's able to be located. Sometimes it's an oral history recording or maybe a historical photo in an online digital archive. Would that help fulfill some data points you're looking for @dustymc ?

@dustymc
Copy link
Contributor Author

dustymc commented Apr 15, 2022

essential, useful data that helps distinguish one John Smith from another

For anyone who reads it: sure. A date buried in there is also completely inaccessible to things like #4551 (and probably most users). The current documentation says "Don’t use remarks when more formal data are possible." which I believe is correct - we do have an appropriate "more formal" field for places (address) and dates (status) so that doesn't belong (or only belong, I don't care what's replicated in remarks to be more readable or etc.) in remarks. We don't have a place for biographical profile so that does belong in remarks. Unless....

add a free text field called biographical profile

New issue, no objection from me (as long as it can be defined in such a way that it's not "remarks when someone felt like using that field").

create a relationship to your organization (associate of)

If they're working for you: Yes, absolutely.

If they tossed a dead rat (or motorcycle or whatever) at you at some point: Nope, over-using relationships will just result in those data not getting cleaned up when we get access to tools (or brains).

born/alive/died/dead data....geographic information in an address field...historical photo i....online digital archive

Any of that will get the agent over the (tentative) current bar. I'd of course like to have all of it and in great detail, but at this point any sort of structured data feels like a great leap forward.

@dustymc
Copy link
Contributor Author

dustymc commented May 5, 2022

The conversation seems to have drawn down, OK to proceed per #4554 (comment)?

@AJLinn
Copy link

AJLinn commented May 5, 2022

If by proceed you mean nuking all the one-name agents, I'm still working on my mega-list to add "alive" info and "shipping" address so there are three points of data. Can you give me time to fix them? I can prioritize for the next couple of days.

@dustymc
Copy link
Contributor Author

dustymc commented May 5, 2022

No hurry, I just don't want to lose whatever momentum we've got going.

Let me know if I can help with anything.

@AJLinn
Copy link

AJLinn commented May 5, 2022

Looks like I have 50 agents to update, which unfortunately I don't think there are any automated wizard things we can do other than looking at their agent activity report and assessing each one individually. We'll see how long it takes!

@dustymc
Copy link
Contributor Author

dustymc commented May 5, 2022

See #4568 - we discussed rebuilding the activity page (somewhere...), let us know what would be useful to surface there.

@dustymc
Copy link
Contributor Author

dustymc commented Sep 13, 2022

start hunting down missing things

I'm not hunting, it popped in to my notifications today, and nothing is missing, it just in the wrong place. I do think this is something that docs/announcements will fix, at least for the vast majority of users.

don't determiners have to be agents, not verbatim?

Not really, that's where #4871 took us intentionally or otherwise, but I don't think anyone's been that brave yet and you don't have to be the first. "Alive when the paper was published" gets at a great deal of the problems and is a significant improvement over the vast majority of our agents, I don't have any huge problems with that. That said, if they're just some random author with some tenuous-at-best connection to your collection, why bother adding them?

@Jegelewicz
Copy link
Member

Is there something for me to do here? I added Risa since she is an author on a paper. I did not add her address or anything else. Do we need to do that for just authors on papers?

@cjconroy currently, this person SHOULD be fine as they are. Plans are to remove all agents in one of the collector roles that don't have anything in their agent profile but names and remarks to verbatim agent. Agents used in other capacities (transactions, publications, identifications, media creation) OR that also include an Arctos username will be left alone for now.

BUT - it really helps the community if any information known at the time the agent is created is added!

@catherpes
Copy link

I don't understand all of this, but I've reviewed the list. These are agents associated with MSB Birds who should remain bonafide agents with some additional comments not in file:

Joan Morrison
Walter Vargas Campos
Abraham Urbay Deceased May 2020.
Dr. Don H. Wolfe
Jose Antonio Otero
Wilfredo Nanez Aizcorbe
Stephen M. Russell -- Author birds of British Honduras which was his grad thesis at LSU I believe; spent some (all?) of his career as curator of the UAZ Bird collection.
Thomas M. Haley Falconer who occasionally donates birds to us. I have no contact information for him.

@Jegelewicz
Copy link
Member

@catherpes please add the information before the end of the year. If I can help in any way, let me know.

@dustymc
Copy link
Contributor Author

dustymc commented Sep 15, 2022

@catherpes those should all have sufficient information now. Most of them had the information, but in remarks where it's not structured/accessible (and most of them had already been updated by the Agents Committee).

@dustymc
Copy link
Contributor Author

dustymc commented Oct 5, 2022

From @Jegelewicz in some migration issue:

I have been contemplating your "known" agents and I am reluctant to add those that include nothing more than associate of [institution]

The goal is to not create unnecessary Agents, those which are capable of carrying the known information as verbatim agents. A vague association with the institution for which collector activities happened can be confidently inferred from the verbatim agent having record being associated with the institution. I fully agree with the assessment, and it should be added to the cleanup. (Phase Two - I think we've got enough to deal with at the moment!)

@jtgiermakowski
Copy link

So what's the deadline for this cleanup? end of calendar year? Could I get a list of that for MSB:Herps ? I tried the SQL for UCM but no luck... thanks!

@dustymc
Copy link
Contributor Author

dustymc commented Oct 17, 2022

end of calendar year?

As far as I know.

list

Screen Shot 2022-10-17 at 9 50 46 AM

@barke042
Copy link

I've been thinking a bit about this issue. One of the things that has the potential to be valuable about agents in the context of Arctos is the idea that having an agent assigned as collector to a given specimen is an assertion that the same person had that role as they did elsewhere in other collections. In the context of our bird data, for instance, we've been going through our agents and comparing them to the Arctos agent list, and identifying those where there is good evidence they are the same person--same time span, same areas of collection, institutions where we know they have worked or to which we know their material was distributed. I am concerned that that (not insignificant) investment of time not be tossed out because the agent we "synonymized" doesn't have anything other than a collector role. It makes sense for collectors to go into verbatim agent by default initially, but where somebody has taken the time to gather evidence in support of the assertion that our "so and so" is the same as their "so and so", isn't it worth preserving that information?

Can somebody please clarify for me what it will take at a minimum for an agent to NOT get bumped into verbatim agent? I'd like to add whatever it takes to the core agents we've spent time on cleaning up to make sure they don't get bumped.

@dustymc
Copy link
Contributor Author

dustymc commented Oct 21, 2022

taken the time to gather evidence

This is just a request to record that information in a way it can be queried/is useful to the next person. Address, relationship, or status all carry more information than a string can and so will prevent deletion. (And see #5172 - I don't think we're doing that quite correctly, your input is most welcome.)

@barke042
Copy link

barke042 commented Oct 21, 2022

Thanks for the clarification. In that case, I will see that we include some level of relationship with the Bell Museum (or another institution, if it involves transferred material) for those agents. Where easily available (e.g., obits exist or HR files allow), I will try to put in a status for born/died if possible.

@dustymc
Copy link
Contributor Author

dustymc commented Oct 21, 2022

relationship with the Bell

Do please keep in mind the ultimate goals of this, which is having sufficient information to do things like drop the (silly, but necessary for usability) unique index on preferred agent name. Who knows where that line really is, but a "Jones" that dropped a dead squirrel off and a Jones who collected as an employee (and so probably has notes and such) are likely on different sides of it.

tl;dr: plz don't make relationships just to preserve otherwise low-information agents

@dustymc
Copy link
Contributor Author

dustymc commented Jan 17, 2023

515378 attributes created.

38982 agents removed.

Removed agents and agent names:

Archive 4.zip

@dustymc dustymc closed this as completed Jan 17, 2023
@krgomez
Copy link

krgomez commented Jan 18, 2023

I am a little frustrated that we now have nearly 1000 records with verbatim agents. I had fixed all of the agents with low information that were provided in the spreadsheet that was shared with us all months ago, but there were apparently over 500 more agents with low information that have now become verbatim agents. I don't understand why these were not flagged in order to give me a chance to update the records before this agent removal process. For many of these people, we do have information available to us that could have been added to flesh out the agent profiles. This makes for a lot of extra work to recreate these agents and link them back to their records. I understand that we haven't lost any information and that everything is still functional, but most of these people should be real agents in Arctos.

@catherpes
Copy link

Can I get Silas Fischer elevated to agent status?
https://arctos.database.museum/guid/MSB:Bird:60602
https://www.sefischer.com/research

@Jegelewicz
Copy link
Member

@catherpes instructions are here - https://handbook.arctosdb.org/how_to/How-to-Agentify-Verbatim-Agents.html

Let us know if they aren't clear or are missing anything.

@catherpes
Copy link

The instructions do not get me to the page shown in the instructions when I choose manage collectors. Verbatim collectors are not shown as they are in the tutorial. Can someone who knows this please do this for me? I shouldn't have to undo what Arctos 'fixed' for me.

Also, maybe this should be its own issue, but it would be helpful to be able to search on verbatim agents in the manage agents search so I don't have to search agents, then search in the specimen search window.

@Jegelewicz
Copy link
Member

search on verbatim agents in the manage agents search

image

@catherpes
Copy link

thanks

@Jegelewicz
Copy link
Member

@catherpes I added Silas Fischer as a collector in all the MSB:Bird records where he was listed as verbatim.

@catherpes
Copy link

catherpes commented Mar 23, 2023 via email

@Jegelewicz
Copy link
Member

@catherpes if you get another one of these - pass it to me. I'll use it to create a video tutorial (banging myself on the head for not doing it this time...)

@mkoo
Copy link
Member

mkoo commented Mar 23, 2023

😊 for the video not for the head-banging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement I think this would make Arctos even awesomer! Priority-Critical (Arctos is broken) Critical because it is breaking functionality.
Projects
None yet
Development

No branches or pull requests