Slow calculation of ambiguity feature in MLLM #822

RietdorfC · 2024-12-16T09:17:23Z

Dear Osma, dear annif-team,

As discussed in the annif-users-group (https://groups.google.com/g/annif-users/c/8d3AL4LAzBQ), I have added the debugging lines and performed the suggest operation with an MLLM model trained with the full GND vocabulary set we use (1.4M subjects) on a document with a long processing time (305.72 sec.). Please find the ziped tsets.jsonl file attached to this issue.

Best regards
Clemens

tsets.zip

osma · 2024-12-16T09:42:13Z

Thanks @RietdorfC !

The tsets.jsonl file is quite revealing: you have some matches with extreme repetition, especially token id 194284. I'm not sure what it is without having access to the model internals, but it seems to be some word that matches a lot of different GND subjects (2628 to be exact). It could be a common name like "Smith"; if you have a lot of names like "Smith, A." and "Smith, B." (even as altLabels) in GND, then the analyzer in Annif will likely discard the initials (because they are too short to be considered words) and MLLM will just see a lot of concepts all having the same label "smith", which are potential matches every time the word "Smith" appears in the document text.

I'll see if anything can be done to speed up the slow ambiguity calculation, but this is a symptom of matching gone wrong in other ways as well.

osma · 2024-12-20T12:57:55Z

Hi @RietdorfC , I've now implemented a new, hopefully much faster method for calculating the ambiguity feature in PR #825. Could you please test the code in that branch? I'm especially interested in

Does the code run in your environment?
Does it reduce the train and suggest time for MLLM?
Does it achieve the same level of quality?

RietdorfC · 2025-01-06T12:22:32Z

Hi @osma,
Thank you for your quick reply and helpfull answer and the implementation of the new method!

I have found the token that was responisble for the large number of matches (and the coressponding matches). We will investigate this issue further.

I will test your new method and report back to you as soon as possible.

Best regards
Clemens

RietdorfC · 2025-01-29T15:30:02Z

Hi @osma,
We tested your new method for calculating the ambiguity feature with MLLM models trained with the full GND vocabulary set we use (1.4M subjects). To answer your questions:

We were able to run the code in our environment without any problems.
The new method significantly reduces the time it takes MLLM to process our documents! We found that a model using the new method processed a test corpus of 8,610 full text documents (with an upper limit of 50K words) about 50 per cent faster than a model using the old method. The same model using the new method took 14 seconds to process the problematic document mentioned above (which took 306 seconds using the old model).
To adress the question of the level of quality, we evaluated an MLLM model with the original calculation of the ambiguity features (MLLM org) and an MLLM model with the new calculation of the ambiguity features (MLLM new) based on a total of 23,676 full text documents (also with an upper limit of 50K words). With a limit of 5 and a threshold of 0.05, both candidates achieve the following results:

model	precision	recall	F1-score	NDCG	n_i	n_m
MLLM org	0,229807	0,384864	0,262386	0,380644	23676	23605
MLLM new	0,23069	0,387657	0,263582	0,381777	23676	23596

Thus, we conclude that the new method achieves the same level of quality, although we cannot precisely determine potential differences in the results, as each training of MLLM leads to slightly different results.

Best regards
Clemens

osma · 2025-01-29T15:32:29Z

@RietdorfC Excellent! Thanks a lot for the testing and the detailed results!

I will do one more round of verifying that everything is OK, then I think we can merge the PR so that it goes into the next release of Annif.

RietdorfC · 2025-01-29T15:41:50Z

@osma You are welcome and thank you very much for implementing the new method. It will help us a lot.

osma mentioned this issue Dec 20, 2024

Optimize MLLM ambiguity calculation #825

Merged

osma closed this as completed in #825 Jan 30, 2025

juhoinkinen added this to the 1.3 milestone Feb 6, 2025

juhoinkinen added the enhancement label Feb 6, 2025

osma mentioned this issue Feb 7, 2025

Ensure compatibility with MLLM models trained using Annif 1.2 or older #834

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow calculation of ambiguity feature in MLLM #822

Slow calculation of ambiguity feature in MLLM #822

RietdorfC commented Dec 16, 2024

osma commented Dec 16, 2024

osma commented Dec 20, 2024

RietdorfC commented Jan 6, 2025

RietdorfC commented Jan 29, 2025

osma commented Jan 29, 2025

RietdorfC commented Jan 29, 2025

Slow calculation of ambiguity feature in MLLM #822

Slow calculation of ambiguity feature in MLLM #822

Comments

RietdorfC commented Dec 16, 2024

osma commented Dec 16, 2024

osma commented Dec 20, 2024

RietdorfC commented Jan 6, 2025

RietdorfC commented Jan 29, 2025

osma commented Jan 29, 2025

RietdorfC commented Jan 29, 2025