Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make contributing easier - add information about the project(s) to the final dictionary files #408

Open
C0rn3j opened this issue Dec 7, 2024 · 1 comment

Comments

@C0rn3j
Copy link

C0rn3j commented Dec 7, 2024

Let me describe my journey to get here, to give an idea of why and what is necessary to improve.

Initially, I wanted to get Telegram Desktop to understand words like expectedly (it already has unexpectedly), or autocorrect.

There is zero information in Telegram GUI, which honestly should at least have an attribution nearby:

image

image

image
Telegram has the .dic and .aff files, aff has comments, so information can be added there at least.

Then through combination of this lovingly-stablebot-locked issue telegramdesktop/tdesktop#7960
and someone linking me to the relevant doc portion here I got to learn I am looking for Hunspell.

Hunspell seems to have open PR for many years, such as this hunspell/hunspell#612 2018 one, which seems fairly simple, and does not seem well maintained.

https://github.com/hunspell/hunspell?tab=readme-ov-file#dictionaries This made me believe that LibreOffice is somehow the upstream source for the dictionaries.

I went to check out that Arch Linux repos have https://archlinux.org/packages/extra/any/hunspell-en_us/ at version 2020.12.07.

Checking out the LO repo, the dictionary has a 2021 commit - https://cgit.freedesktop.org/libreoffice/dictionaries/commit/en/en_US.dic?id=4fa94195b8136364dd40bf2b0366a0fe32058899

Later I found out that this is probably just LO using the upstream source here in this repo somehow, but wasted time figuring out if the site Arch uses is actually up to date.

There is a lot of mentions about SCOWL and its sizes across the various docs:
https://cgit.freedesktop.org/libreoffice/dictionaries/tree/en/README_en_US.txt

I have also tried to look up the words I had problems with + words from the 2021 PR here as suggested, which led me to:

A) believe they should indeed be there, as they have more "Should Include" stars than words that are already included:
image

B) Very confused as "larger (size 80) SCOWL size [1]" is:
"[1] The word was not in any of the speller dictionaries but was found in an larger SCOWL size. The smaller dictionaries included words up to size 60, and the larger dictionary include words up to size 70."
Which is not at all helpful as to figuring out what my dictionary size is or where to look for the bigger one?


TL;DR

  • Make sure it is EASY for people to contribute.
  • Add comments to the .aff files with information.
  • Explain what SCOWL/sizes are better on the lookup aspell page
  • What SHOULD the projects be attributing when using the huspell(?) dictionaries? I am sure TG upstream would be fine with adding some standardized linkthrough, but what to tell them.

Information that should be added at minimum in my opinion is:

  • Versioning info, to know what dictionary version one is looking at
  • Dictionary size info
  • Link to (or hardcode as text) a quick FAQ with how to add a missing word - info on the word checker look up the words should be there, for example.
  • Link to the repository where to contribute (this one(?))
  • Whatever else is necessary to NOT make people go through the 9 circles of Hell that I went through as described above
  • Add info about the process from getting it included upstream here to getting it included downstream

If I am barking up the wrong tree in some cases, please direct me appropriately, I am dazed and confused.

@C0rn3j C0rn3j changed the title Add information about the project(s) to the final dictionary files Make contributing easier - add information about the project(s) to the final dictionary files Dec 7, 2024
@kevina
Copy link
Member

kevina commented Dec 7, 2024

A lot of these seam like downstream issues in particular telegram could better document where the dictionary comes from. There is also a lot of outdated information out there that needs to be updated.

The way to suggest words is to just open an issue like you did. Version information is included in the README with the official dictionary. I am open to adding it to the actual affix file also. At the very end of the README there is information on how to dictionary is created, for example:

Build Date: Mon Dec  7 20:19:27 EST 2020
Wordlist Command: mk-list --accents=strip en_US 60

However, I agree this information is easy to miss and also rather cryptic to the average user. I am therefor open to add more human readable information to both the README and affix file to how the dictionary is created and in particular what SCOWL size is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants