kotus-fo

Python script to parse wordlist of Finland Swedish dialects published by Institute for the Languages of Finland and prepare data for Wikidata

På svenska: python script för att behandla svenska ord från "Ordbok över Finlands svenska folkmål" utgiven av "Institutet för de inhemska språken" för Wikidata

The publication can be found at https://kaino.kotus.fi/fo/ and is described on Wikipedia at https://sv.wikipedia.org/wiki/Ordbok_över_Finlands_svenska_folkmål

Workflow and functions

1. Creates base data

Collects base data for 79 000 words in dataframe by parsing XML file of publication. XML-files are available for download at https://www.kotus.fi/aineistot/tietoa_aineistoista/sahkoiset_aineistot_kootusti. Parses the metadata in XML to dataframe (Regions, Dialects, Grammer, Gloss, Examples, See Also).

2. Matches words to Wikidata lexemes

Searches words in Wikidata for corresponding lexemes with the same lemma, language (Swedish) and category (noun, verb, etc) and adds Wikidata L-code to dataframe. Saves/reads results from/to cache.json in case process is interrupted or later runs.

At this point: can save result in dataframe as a pickle-file, and can reload dataframe from pickle-file to skip previous steps for fast processing.

Creates Wikidata Quickstatement commands, to add identifier P12032 (Ordbok över Finlands svenska folkmål ID) to corresponding existing Wikidata lexemes.

3. Creates dialect word as IPA

Converts all dialect words in fin (over 133 000) to IPA (International Phonetic Alphabet) based on conversion table. Wikidata lexemes uses IPA.

Filters dialect words that are not in "lemma form" according to 6 rules, created based on word list. The challenge has been to mechanically filter out dialect words that are not in the right "lemma form", due to the orignal word list being written to be read by humans, not machines.

Above was made possible by analyzing used characters in dialects words to create a IPA-conversion table for dialect words written in fin (not grov).

Name	Name	Last commit message	Last commit date
Latest commit Cannot retrieve latest commit at this time. History 32 Commits
.gitignore	.gitignore	fin2ipa conversion	Oct 14, 2023
LICENSE	LICENSE	Initial commit	Oct 10, 2023
README.md	README.md	readme	Nov 27, 2023
fin2ipa.tsv	fin2ipa.tsv	conversion table	Nov 28, 2023
kotus-fo.py	kotus-fo.py	regel 6	Nov 27, 2023
kotus-fo_create_lexem.py	kotus-fo_create_lexem.py	add dialects to lexemes	Nov 30, 2023
regioner.tsv	regioner.tsv	regioner konverteringstabell	Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kotus-fo

Workflow and functions

1. Creates base data

2. Matches words to Wikidata lexemes

3. Creates dialect word as IPA

About

Releases

Packages

Languages

License

projekt-fredrika/kotus-fo

Folders and files

Latest commit

History

Repository files navigation

kotus-fo

Workflow and functions

1. Creates base data

2. Matches words to Wikidata lexemes

3. Creates dialect word as IPA

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages