Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] First class Lingua Libre support #263

Closed
shenlebantongying opened this issue Dec 21, 2022 · 4 comments · Fixed by #268
Closed

[Feature] First class Lingua Libre support #263

shenlebantongying opened this issue Dec 21, 2022 · 4 comments · Fixed by #268

Comments

@shenlebantongying
Copy link
Collaborator

shenlebantongying commented Dec 21, 2022

Forvo is privatizing voluntary works and there is no free API anymore.

Lingua libre @lingua-libre is a better one:

https://lingualibre.org

The project is under the name of Wikimedia France.

There are 28k English pronunciations already, I think the project is mature enough https://commons.wikimedia.org/wiki/Category:Lingua_Libre_pronunciation-eng

Their data is stored at https://commons.wikimedia.org! They will probably exist forever.


To get pronunciations, we just do a query against the Wikipedia commons database.

Sample query -> "nice" in English:

Just a regex, Files uploaded through Lingua libre have a fixed format of LL-<language code>-<author>-<word>.wav

curl "https://commons.wikimedia.org/w/api.php?action=query&format=json&prop=imageinfo&generator=search&iiprop=url&iimetadataversion=1&iiextmetadatafilter=Categories&gsrsearch=intitle%3A%2FLL-Q1860%20%5C(eng%5C)-.*-nice%5C.wav%2F&gsrnamespace=6&gsrlimit=10&gsrwhat=text"

Screenshot 2022-12-21 at 12 00 08 AM

Then just grab the url from returned json

{
  "batchcomplete": "",
  "query": {
    "pages": {
      "88511149": {
        "pageid": 88511149,
        "ns": 6,
        "title": "File:LL-Q1860 (eng)-Back ache-nice.wav",
        "index": 2,
        "imagerepository": "local",
        "imageinfo": [
          {
            "url": "https://upload.wikimedia.org/wikipedia/commons/6/6a/LL-Q1860_%28eng%29-Back_ache-nice.wav",
            "descriptionurl": "https://commons.wikimedia.org/wiki/File:LL-Q1860_(eng)-Back_ache-nice.wav",
            "descriptionshorturl": "https://commons.wikimedia.org/w/index.php?curid=88511149"
          }
        ]
      },
      "73937351": {
        "pageid": 73937351,
        "ns": 6,
        "title": "File:LL-Q1860 (eng)-Nattes \u00e0 chat-nice.wav",
        "index": 1,
        "imagerepository": "local",
        "imageinfo": [
          {
            "url": "https://upload.wikimedia.org/wikipedia/commons/b/b0/LL-Q1860_%28eng%29-Nattes_%C3%A0_chat-nice.wav",
            "descriptionurl": "https://commons.wikimedia.org/wiki/File:LL-Q1860_(eng)-Nattes_%C3%A0_chat-nice.wav",
            "descriptionshorturl": "https://commons.wikimedia.org/w/index.php?curid=73937351"
          }
        ]
      }
    }
  }
}

The API is srsearch in doc, and I have zero ideas why it must be used with the prefix g: gsrsearch https://www.mediawiki.org/wiki/API:Search


Get supported language ids -> do this query on https://commons-query.wikimedia.org/

https://lingualibre.org/wiki/Help:SPARQL#Is_Language_.28d:Q34770.29_.E2.86.92_List_existing_languages_with:_LL_Qid.2C_ISO_639-3.2C_Name


Without a personal token, the rate-limited is 500/h which should be enough for most people.

https://api.wikimedia.org/wiki/Documentation/Getting_started/Rate_limits


The interface should be similar to Forvo's in Goldendict's Dict settings, and it does need users to add language code or the API will timeout.

@shenlebantongying
Copy link
Collaborator Author

shenlebantongying commented Dec 21, 2022

Their data is usable already somehow but it is not usable out-of-box. User can just download their dataset https://lingualibre.org/datasets/ and put it under the sound folder.

https://lingualibre.org/wiki/LinguaLibre:Chat_room/Archives/2021#.22How_to_use_Lingua_Libre_for_your_language_learning.22

image

@Exponent4806
Copy link

@lingua-libre French pronunciations are very comprehensive (>200.000) and the project will grow in the future.

It is a good idea to add support in GoldenDict to that wonderful project !

@shenlebantongying
Copy link
Collaborator Author

Can I implement this in GD/src/dictionary/lingualibre.cc or GD/dictionary/lingualibre.cc rather than directly under the root? @xiaoyifang

The practice of putting everything in the root folder is insane. I don't know why the original author considered /src superfluous goldendict/goldendict@ab88fa4 The project was probably much simpler at that time.

I think we will reorganize source files in future for better maintainability. I prefer to put new code in places in a modular way. Also if we actually do this, some header change is inevitable. We can run https://include-what-you-use.org/ over the codebase for faster build time.

@xiaoyifang
Copy link
Owner

Can I implement this in GD/src/dictionary/lingualibre.cc or GD/dictionary/lingualibre.cc rather than directly under the root? @xiaoyifang

yes, that's nice

. I don't know why the original author considered /src superfluous

I think it's because the original code is migrated from subversion which use src as the default folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants