Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad disambiguation of term "Maroc" in French #160

Open
oterrier opened this issue Jan 26, 2024 · 5 comments
Open

Bad disambiguation of term "Maroc" in French #160

oterrier opened this issue Jan 26, 2024 · 5 comments
Assignees

Comments

@oterrier
Copy link

oterrier commented Jan 26, 2024

Cannot find any example to have "Maroc" disambiguated as the country (Q1028)

For example with this query

    "text": "Au Maroc, arrestation de trois membres présumés du groupe État islamique...\nTrois Marocains affiliés au groupe djihadiste État islamique ont été arrêtés hier selon la police. Ils sont soupçonnés d’avoir  assassiné un policier  dont  le corps calciné a été retrouvé début mars près de Casablanca.",
    "shortText": "",
    "termVector": [],
    "language": {
        "lang": "fr"
    },
    "entities": [],
    "mentions": [
        "wikipedia"
    ],
    "nbest": false,
    "sentence": false,
    "minSelectorScore": 0.2
}

It is disambiguated as French protectorate in Morocco (Q907234)
Some other times as Morocco national football team (Q207337)

But never as Morocco (Q1028) nevertheless it is the concept with the higher conditional probability (0.903404988057546)

I can't explain why: any clue ?

Thx
Olivier

@kairntech
Copy link

kairntech commented May 23, 2024

Some more recent tests in French

En fr, Wikidata sort sur les noms des pays :
Allemagne : disambiguated as Empire allemand, Equipe d'Allemagne de football
Grèce : disambiguated as Grèce antique
Roumanie : disambiguated as Royaume de roumanie

whatever you put in maxTermFrequency

@kairntech
Copy link

request

{
    "text": "Fabrication d'un violoncelle dans un atelier de lutherie à Reghin, en Roumanie, le 22 janvier 2021.",
    "shortText": "",
    "termVector": [],
    "language": {
        "lang": "fr"
    },
    "entities": [],
    "mentions": [
        "wikipedia"
    ],
    "nbest": false,
    "sentence": false,
    "minSelectorScore": 0.2,
    "maxTermFrequency": 5
}

response

{
    "software": "entity-fishing",
    "version": "0.0.6",
    "date": "2024-05-23T14:31:45.359208132Z",
    "runtime": 31,
    "nbest": false,
    "text": "Fabrication d'un violoncelle dans un atelier de lutherie à Reghin, en Roumanie, le 22 janvier 2021.",
    "language": {
        "lang": "fr",
        "conf": 1
    },
    "global_categories": [
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Instrument de musique classique",
            "page_id": 199859
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Violoncelle",
            "page_id": 986894
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Municipalité dans le județ de Mureș",
            "page_id": 11926951
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Instrument à cordes frottées",
            "page_id": 317874
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Royaume de Roumanie",
            "page_id": 8183397
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Page contenant une partition",
            "page_id": 13964105
        },
        {
            "weight": 0.14285714285714288,
            "source": "wikipedia-fr",
            "category": "Lutherie",
            "page_id": 1310062
        }
    ],
    "entities": [
        {
            "rawName": "violoncelle",
            "offsetStart": 17,
            "offsetEnd": 28,
            "confidence_score": 0.551,
            "wikipediaExternalRef": 10822,
            "wikidataId": "Q8371",
            "domains": [
                "Acoustics",
                "Artisanship"
            ]
        },
        {
            "rawName": "atelier de lutherie",
            "offsetStart": 37,
            "offsetEnd": 56,
            "confidence_score": 0.4053,
            "wikipediaExternalRef": 167295,
            "wikidataId": "Q3267878"
        },
        {
            "rawName": "Reghin",
            "offsetStart": 59,
            "offsetEnd": 65,
            "confidence_score": 0.8624,
            "wikipediaExternalRef": 3813284,
            "wikidataId": "Q572478",
            "domains": [
                "Geography",
                "Architecture"
            ]
        },
        {
            "rawName": "Roumanie",
            "offsetStart": 70,
            "offsetEnd": 78,
            "confidence_score": 0.6214,
            "wikipediaExternalRef": 1387867,
            "wikidataId": "Q203493",
            "domains": [
                "Military"
            ]
        },
        {
            "rawName": "22 janvier",
            "offsetStart": 83,
            "offsetEnd": 93,
            "confidence_score": 0.8398,
            "wikipediaExternalRef": 3688,
            "wikidataId": "Q2275",
            "domains": [
                "Geology",
                "Oceanography",
                "Earth"
            ]
        }
    ]
}

@kermitt2 kermitt2 self-assigned this May 28, 2024
@kermitt2
Copy link
Owner

Sorry for the late reply, this is weird indeed, I'll try to see what is happening in the disambiguation process for these countries.

@kde-kairntech
Copy link

Here is a file containing notable errors with disambiguation in French.
ef_erreurs.pdf
Thanks

@lfoppiano
Copy link
Collaborator

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants