Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

Problems with metadata updates from wikidata #345

Closed
MansMeg opened this issue Sep 8, 2023 · 11 comments
Closed

Problems with metadata updates from wikidata #345

MansMeg opened this issue Sep 8, 2023 · 11 comments

Comments

@MansMeg
Copy link
Collaborator

MansMeg commented Sep 8, 2023

In the discussion of Pull Request #344 we identified four different problems

  1. Incorrect updates used in the mapping algorithm.
    We captured incorrect revisions made by wikidata users in iort (changed correct iorts) using the unit test. However, the mapping algorithms had already been run using the incorrect data. Running some of the tests before the mapping algorithm might be a solution.
  2. New "duplicate" people are being added to Wikidata but are not merged - causing errors in the mapping algorithm
    Continuously new people can be added by Wikidata users. This means that when we do the metadata update there can be new duplicates (the same person as multiple wikidata entries). A solution is to list these potential duplicates (new names the mapping algorithm will confuse) with new names/persons the algorithm finds difficult and check them quickly before we run the algorithm.

A potential solution is to structure the metadata updates more than we currently do to capture potential problems more efficiently.

@BobBorges
Copy link
Collaborator

Re 2, they're not new people, but new attributes or whatever that cause more rows to be created in the input/matching/*.csv files.

@BobBorges
Copy link
Collaborator

I will try to outline a procedure for updates from wikidata before the next time we do it, hopefully to avoid some of the trouble we ran into this time.

@salgo60
Copy link
Contributor

salgo60 commented Sep 9, 2023

Let me know if WD has errors

Add sources when changing values

I saw some earlier edits done by the project in Wikidata without sources....

  • Let me know if we should have a short session where I show you how to add/copy the source to a statement
    • statements without a source in Wikidata are not preferred as the source should confirm what you find in WD and make Wikidata a little bit more trustworthy....

alias vs. Name

In WD we can just have sources on properties e.g. change Q5792849 1940037453 on the Name property

image

——
My personal opinion is that the alias field should be used when doing Named Entity Recognition and can contain “all kind” of information compared to the Name Property P2561 were we should have sources confirming the values

@BobBorges
Copy link
Collaborator

In this particular case, it was two individuals in question. My previous edits (with source) were further edited. I put the changes back yesterday.

The edits in question have to do with apparent spelling variants of iort. I don't know myself what the correct variant is, my edits are in line with the spelling in the bio books. If there are sources for the other spelling, then I guess both variants should be on wikidata.

@MansMeg
Copy link
Collaborator Author

MansMeg commented Sep 9, 2023

Yes. But the spelling in the biobooks should be the one that is used when the reference is the biobooks. Right, @salgo60 ?

@BobBorges
Copy link
Collaborator

That's what i was trying to say - my edits have bio book sources and spelling. If alt spellings will also be entered, they should get their own source.

@salgo60
Copy link
Contributor

salgo60 commented Sep 11, 2023

Yes. But the spelling in the biobooks should be the one that is used when the reference is the biobooks. Right, @salgo60 ?

Yes as mentioned before

  • the book "Tvåkammar-riksdagen 1867–1970" has rather often more articles about the same see > 150 persons, see a small check

    • person that I guess are published at different times --> the book themself will maybe have different "i riksdagen kallad" for the same person.... today I only reference a person once even if they are in more articles which I feel is wrong see example below.
      • we also have as a source the book "Enkammarriksdagen 1971-1993/94" I dont know if someone has checked if the two books has the same facts.... I guess not ...
      • Riksdagen has a field "iort" looks they only can store one value in it.... and I feel normally bad quality data se #141
    • example person with more articles Q5795740 "Hederstierna i Stockholm senare Västerås" in books 1:436 |  2:158 | 4:92 - are they identical? in the below example we can see they dont have the same parties --> we should add all books and see what facts they confirm - here is a list of > 150 persons with more articles
      • From the example below we can see that the naming of parties are not consistent and its different
        • 1:436 use "skånska p" / "centern"
        • 2:158 use just centern not "skånska p"
        • 4:92 use "skånska p" and "AK:s center" t
          * Its very important that we get ONE list of parties with unique persistent identifiers that we think has existed and what different name strings are the same parties...

Volume 1 page 436 - skånska p / centern

image image

Volume 2 page 158 - centern

image image

Volume 4 page 92 - skånska p / AK:s center

image

What would be interesting is if we could confirm what is stated in the books with where its mentioned in your corpus and get a better understanding/quality by adding a Property:P4584 "first appearance" based on your corpus

  1. that every unique combination person Ior get an unique persitent identifier in your corpus
  2. that in the Swedish Corpus you also in the TEI code track when its used and have the persistent idemtifier in the TEI
  3. "we" in wikidata can say that the IoT name is same as Riksdagen-corpus xxx
  4. we could start tracking when all unique IoT are first and lasted used based on your corpus

Sources

I would also like to see in your data

  1. persistent unique identifiers for every source used ex. books "Tvåkammar-riksdagen 1867–1970" should have an unique persistent identifier for every book

Examples when "Tvåkammar-riksdagen 1867–1970" is wrong

image image

My suggestion step up and use sources and persistent identifiers

@BobBorges
Copy link
Collaborator

I like the idea of persistent identifiers. Until then, I think we can solve (close) this issue with a metadata update procedure.

  1. start a fresh branch off dev

  2. requery metadata
    scripts/wikidata_query.py and 'scripts/wikidata_process.py`

  3. run test.db.py locally

    • will find changed wiki_id
    • will find edits that conflict with our unit test files (someone edits / deletes iort from wikidata)

    ---> update wiki_ids in unit test files (I will write a script to do this efficiently)
    ---> address edits on wikidata

  4. repeat 1 and 2 until test.db.py passes

  5. redetect.py to remap speakers to intros in protocols

  6. run test.mp.py locally (other tests?)

    • ensure that everything looks like it works (no stray wiki IDs that aren't in metadata or whatevr)
  7. save diff to (an untracked) file

    • this helped me, when I could search the whole diff to give good answers to those looking at the PR
  8. sample-git-dif on protocols

    • mk markdown
  9. git add ONLY sampled protocols
    -commit / push
    -open pr -- post markdown

    ---> unit tests will still fail on remote : is ok

  10. when sampled diffs look ok

    • add /commit /push rest of protocols

    • unit tests should pass on remote --> merge

The issues last time around would have been spotted and fixed very quickly if I were following this as a guide.

@MansMeg
Copy link
Collaborator Author

MansMeg commented Sep 20, 2023

That sounds like a good solution. Maybe put this in the repo wiki for now?

@salgo60
Copy link
Contributor

salgo60 commented Sep 20, 2023

FYI: We have a suspected duplicate in WIkidata that I have asked other people for a second opinion but no feedback yet

image

I used Property:P460 "said to be the same as"

image

The sv:Wikipedia article is marked

image

@BobBorges
Copy link
Collaborator

Maybe put this in the repo wiki for now?

done.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants