Problems with metadata updates from wikidata #345

MansMeg · 2023-09-08T15:00:29Z

In the discussion of Pull Request #344 we identified four different problems

Incorrect updates used in the mapping algorithm.
We captured incorrect revisions made by wikidata users in iort (changed correct iorts) using the unit test. However, the mapping algorithms had already been run using the incorrect data. Running some of the tests before the mapping algorithm might be a solution.
New "duplicate" people are being added to Wikidata but are not merged - causing errors in the mapping algorithm
Continuously new people can be added by Wikidata users. This means that when we do the metadata update there can be new duplicates (the same person as multiple wikidata entries). A solution is to list these potential duplicates (new names the mapping algorithm will confuse) with new names/persons the algorithm finds difficult and check them quickly before we run the algorithm.

A potential solution is to structure the metadata updates more than we currently do to capture potential problems more efficiently.

BobBorges · 2023-09-08T15:26:46Z

Re 2, they're not new people, but new attributes or whatever that cause more rows to be created in the input/matching/*.csv files.

BobBorges · 2023-09-08T15:28:26Z

I will try to outline a procedure for updates from wikidata before the next time we do it, hopefully to avoid some of the trouble we ran into this time.

salgo60 · 2023-09-09T03:21:23Z

Let me know if WD has errors

Add sources when changing values

I saw some earlier edits done by the project in Wikidata without sources....

Let me know if we should have a short session where I show you how to add/copy the source to a statement
- statements without a source in Wikidata are not preferred as the source should confirm what you find in WD and make Wikidata a little bit more trustworthy....

alias vs. Name

In WD we can just have sources on properties e.g. change Q5792849 1940037453 on the Name property

——
My personal opinion is that the alias field should be used when doing Named Entity Recognition and can contain “all kind” of information compared to the Name Property P2561 were we should have sources confirming the values

BobBorges · 2023-09-09T11:36:54Z

In this particular case, it was two individuals in question. My previous edits (with source) were further edited. I put the changes back yesterday.

The edits in question have to do with apparent spelling variants of iort. I don't know myself what the correct variant is, my edits are in line with the spelling in the bio books. If there are sources for the other spelling, then I guess both variants should be on wikidata.

MansMeg · 2023-09-09T11:40:21Z

Yes. But the spelling in the biobooks should be the one that is used when the reference is the biobooks. Right, @salgo60 ?

BobBorges · 2023-09-09T16:06:34Z

That's what i was trying to say - my edits have bio book sources and spelling. If alt spellings will also be entered, they should get their own source.

salgo60 · 2023-09-11T09:41:24Z

Yes. But the spelling in the biobooks should be the one that is used when the reference is the biobooks. Right, @salgo60 ?

Yes as mentioned before

the book "Tvåkammar-riksdagen 1867–1970" has rather often more articles about the same see > 150 persons, see a small check
- person that I guess are published at different times --> the book themself will maybe have different "i riksdagen kallad" for the same person.... today I only reference a person once even if they are in more articles which I feel is wrong see example below.
  - we also have as a source the book "Enkammarriksdagen 1971-1993/94" I dont know if someone has checked if the two books has the same facts.... I guess not ...
  - Riksdagen has a field "iort" looks they only can store one value in it.... and I feel normally bad quality data se #141
- example person with more articles Q5795740 "Hederstierna i Stockholm senare Västerås" in books 1:436 | 2:158 | 4:92 - are they identical? in the below example we can see they dont have the same parties --> we should add all books and see what facts they confirm - here is a list of > 150 persons with more articles
  - From the example below we can see that the naming of parties are not consistent and its different
    - 1:436 use "skånska p" / "centern"
      - my try is
        
        skånska p = Q10671173
        
        centern = Q10411412 not same as Q10444846
    - 2:158 use just centern not "skånska p"
    - 4:92 use "skånska p" and "AK:s center" t
      * Its very important that we get ONE list of parties with unique persistent identifiers that we think has existed and what different name strings are the same parties...

Volume 1 page 436 - skånska p / centern

Volume 2 page 158 - centern

Volume 4 page 92 - skånska p / AK:s center

there are also other sources for "i riksdagen kallad" as the book "Enkammarriksdagen 1971-1993/94"
- Riksdagen has in its open data a field that only can handle one value see #141 - I guess bad quality....

What would be interesting is if we could confirm what is stated in the books with where its mentioned in your corpus and get a better understanding/quality by adding a Property:P4584 "first appearance" based on your corpus

that every unique combination person Ior get an unique persitent identifier in your corpus
that in the Swedish Corpus you also in the TEI code track when its used and have the persistent idemtifier in the TEI
"we" in wikidata can say that the IoT name is same as Riksdagen-corpus xxx
we could start tracking when all unique IoT are first and lasted used based on your corpus

Sources

I would also like to see in your data

persistent unique identifiers for every source used ex. books "Tvåkammar-riksdagen 1867–1970" should have an unique persistent identifier for every book

Examples when "Tvåkammar-riksdagen 1867–1970" is wrong

we had today an discussion about Bertha Wellin Q4895524 and then it looks like both "Tvåkammar-riksdagen 1867–1970" and SKBL is wrong see Diskussion:Bertha_Wellin#WD-mallen
SPARQL depreciated statements - swedish - english

see same problem with Riksarkivet SBL #35
- SPARQL WIkidata

My suggestion step up and use sources and persistent identifiers

BobBorges · 2023-09-20T15:17:55Z

I like the idea of persistent identifiers. Until then, I think we can solve (close) this issue with a metadata update procedure.

start a fresh branch off dev
requery metadata
scripts/wikidata_query.py and 'scripts/wikidata_process.py`
run test.db.py locally
- will find changed wiki_id
- will find edits that conflict with our unit test files (someone edits / deletes iort from wikidata)
---> update wiki_ids in unit test files (I will write a script to do this efficiently)
---> address edits on wikidata
repeat 1 and 2 until test.db.py passes
redetect.py to remap speakers to intros in protocols
run test.mp.py locally (other tests?)
- ensure that everything looks like it works (no stray wiki IDs that aren't in metadata or whatevr)
save diff to (an untracked) file
- this helped me, when I could search the whole diff to give good answers to those looking at the PR
sample-git-dif on protocols
- mk markdown
git add ONLY sampled protocols
-commit / push
-open pr -- post markdown

---> unit tests will still fail on remote : is ok
when sampled diffs look ok
- add /commit /push rest of protocols
- unit tests should pass on remote --> merge

The issues last time around would have been spotted and fixed very quickly if I were following this as a guide.

MansMeg · 2023-09-20T15:20:42Z

That sounds like a good solution. Maybe put this in the repo wiki for now?

salgo60 · 2023-09-20T16:25:16Z

FYI: We have a suspected duplicate in WIkidata that I have asked other people for a second opinion but no feedback yet

I used Property:P460 "said to be the same as"

The sv:Wikipedia article is marked

BobBorges · 2023-09-21T15:03:05Z

Maybe put this in the repo wiki for now?

done.

This was referenced Sep 11, 2023

Issues with the book Tvåkammar-riksdagen 1867–1970 salgo60/Wikidata_riksdagen-corpus#157

Open

Riksarkivet SBL: Arbeta datadrivet och metadataroundtripping salgo60/Svenskaforsamlingar#35

Open

BobBorges closed this as completed Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with metadata updates from wikidata #345

Problems with metadata updates from wikidata #345

MansMeg commented Sep 8, 2023

BobBorges commented Sep 8, 2023

BobBorges commented Sep 8, 2023

salgo60 commented Sep 9, 2023 •

edited

Loading

BobBorges commented Sep 9, 2023

MansMeg commented Sep 9, 2023

BobBorges commented Sep 9, 2023

salgo60 commented Sep 11, 2023 •

edited

Loading

BobBorges commented Sep 20, 2023

MansMeg commented Sep 20, 2023

salgo60 commented Sep 20, 2023 •

edited

Loading

BobBorges commented Sep 21, 2023

Problems with metadata updates from wikidata #345

Problems with metadata updates from wikidata #345

Comments

MansMeg commented Sep 8, 2023

BobBorges commented Sep 8, 2023

BobBorges commented Sep 8, 2023

salgo60 commented Sep 9, 2023 • edited Loading

Add sources when changing values

alias vs. Name

BobBorges commented Sep 9, 2023

MansMeg commented Sep 9, 2023

BobBorges commented Sep 9, 2023

salgo60 commented Sep 11, 2023 • edited Loading

Volume 1 page 436 - skånska p / centern

Volume 2 page 158 - centern

Volume 4 page 92 - skånska p / AK:s center

Sources

Examples when "Tvåkammar-riksdagen 1867–1970" is wrong

My suggestion step up and use sources and persistent identifiers

BobBorges commented Sep 20, 2023

MansMeg commented Sep 20, 2023

salgo60 commented Sep 20, 2023 • edited Loading

BobBorges commented Sep 21, 2023

salgo60 commented Sep 9, 2023 •

edited

Loading

salgo60 commented Sep 11, 2023 •

edited

Loading

salgo60 commented Sep 20, 2023 •

edited

Loading