Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow calculation of ambiguity feature in MLLM #822

Closed
RietdorfC opened this issue Dec 16, 2024 · 6 comments · Fixed by #825
Closed

Slow calculation of ambiguity feature in MLLM #822

RietdorfC opened this issue Dec 16, 2024 · 6 comments · Fixed by #825
Milestone

Comments

@RietdorfC
Copy link

Dear Osma, dear annif-team,

As discussed in the annif-users-group (https://groups.google.com/g/annif-users/c/8d3AL4LAzBQ), I have added the debugging lines and performed the suggest operation with an MLLM model trained with the full GND vocabulary set we use (1.4M subjects) on a document with a long processing time (305.72 sec.). Please find the ziped tsets.jsonl file attached to this issue.

Best regards
Clemens

tsets.zip

@osma
Copy link
Member

osma commented Dec 16, 2024

Thanks @RietdorfC !

The tsets.jsonl file is quite revealing: you have some matches with extreme repetition, especially token id 194284. I'm not sure what it is without having access to the model internals, but it seems to be some word that matches a lot of different GND subjects (2628 to be exact). It could be a common name like "Smith"; if you have a lot of names like "Smith, A." and "Smith, B." (even as altLabels) in GND, then the analyzer in Annif will likely discard the initials (because they are too short to be considered words) and MLLM will just see a lot of concepts all having the same label "smith", which are potential matches every time the word "Smith" appears in the document text.

I'll see if anything can be done to speed up the slow ambiguity calculation, but this is a symptom of matching gone wrong in other ways as well.

@osma
Copy link
Member

osma commented Dec 20, 2024

Hi @RietdorfC , I've now implemented a new, hopefully much faster method for calculating the ambiguity feature in PR #825. Could you please test the code in that branch? I'm especially interested in

  1. Does the code run in your environment?
  2. Does it reduce the train and suggest time for MLLM?
  3. Does it achieve the same level of quality?

@RietdorfC
Copy link
Author

Hi @osma,
Thank you for your quick reply and helpfull answer and the implementation of the new method!

I have found the token that was responisble for the large number of matches (and the coressponding matches). We will investigate this issue further.

I will test your new method and report back to you as soon as possible.

Best regards
Clemens

@RietdorfC
Copy link
Author

Hi @osma,
We tested your new method for calculating the ambiguity feature with MLLM models trained with the full GND vocabulary set we use (1.4M subjects). To answer your questions:

  1. We were able to run the code in our environment without any problems.
  2. The new method significantly reduces the time it takes MLLM to process our documents! We found that a model using the new method processed a test corpus of 8,610 full text documents (with an upper limit of 50K words) about 50 per cent faster than a model using the old method. The same model using the new method took 14 seconds to process the problematic document mentioned above (which took 306 seconds using the old model).
  3. To adress the question of the level of quality, we evaluated an MLLM model with the original calculation of the ambiguity features (MLLM org) and an MLLM model with the new calculation of the ambiguity features (MLLM new) based on a total of 23,676 full text documents (also with an upper limit of 50K words). With a limit of 5 and a threshold of 0.05, both candidates achieve the following results:
model precision recall F1-score NDCG n_i n_m
MLLM org 0,229807 0,384864 0,262386 0,380644 23676 23605
MLLM new 0,23069 0,387657 0,263582 0,381777 23676 23596

Thus, we conclude that the new method achieves the same level of quality, although we cannot precisely determine potential differences in the results, as each training of MLLM leads to slightly different results.

Best regards
Clemens

@osma
Copy link
Member

osma commented Jan 29, 2025

@RietdorfC Excellent! Thanks a lot for the testing and the detailed results!

I will do one more round of verifying that everything is OK, then I think we can merge the PR so that it goes into the next release of Annif.

@RietdorfC
Copy link
Author

@osma You are welcome and thank you very much for implementing the new method. It will help us a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants