-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use subsets of vocabularies in ensembles #596
Comments
I try to give some answer, however I'm not very sure about the details right now.
It is possible to use such a setup for ensemble (but not for neural-network ensemble). About the benefits and complications of this I cannot say.
Yes, the vocabulary loaded to the ensemble project acts an allow list. A warning is shown every time an unknown URI is fed to the ensemble from a source project as an suggestion. I would be quite careful when implementing such a tweaked setup right now, because there is a (small) change that it would not work in the future, if the inner operations in Annif change. Of course experimenting with various vocabulary tweaks like you have in mind can provide valuable insights, and we would like hear if they are or are not successful. For general knowledge about the subject (I think you have seen the discussion already), an actual feature for allow-/deny-listing subjects has been proposed in the issue 538. Maybe that feature would be more suitable for your case. Please let us know if you have any suggestions or thoughts of the feature. |
Whether this works or not depends a lot on which backends are involved. Let me explain the background a bit. Annif internally represents the results of a suggest operation using two alternative classes: VectorSuggestionResult and ListSuggestionResult. The first one uses a fixed vector representation, basically a long string of numbers whose length is the size of the vocabulary. The second one instead represents only the top K suggestions as a list which includes the URI and score. Different backends use different representations depending on which one is the most convenient to produce and consume. They can be converted to each other, although it takes some computation. The vector representation cannot cope with the situation above, where V1 and V2 are subsets of V. So any backend that uses this (including the NN ensemble) will not work. But if you can avoid that, then it probably works, although this was not really something that Annif was originally designed for. Anyway, I think supporting this more generally - making it possible to use different flavors of a vocabulary in an ensemble and its source backends - would be a nice goal which shouldn't be too hard to implement. It may be enough to adjust the NN ensemble a little bit to fix the vector size mismatch. But more generally, ensembles should be prepared to accept suggestion results with a different vocabulary and map them by URI, regardless of the representation (vector or list). |
@osma , @juhoinkinen Thank you for the detailed explanation. It's great that you consider to implement this! Please let me know, if you need help in testing this change. |
Commenting here our use (Finto AI) in mind. YSO-places is a subset of YSO (when used in the vocabulary of Finto AI YSO projects), and there could be a specialized model for suggesting only concepts out of YSO-places. There could be even a specialized backend for this; the idea came from the upcoming special issue "Geographic Information Extraction from Texts" of Information Processing & Management jounal. |
Thanks @juhoinkinen for sharing the interesting use case on the topic! |
I've a question about the usage of vacabularies in an ensemble. Given a vocabulary V which is used in an ensemble and two vocabularies V1 and V2, which are used by different backends of the ensemble (e.g. omikuji and mllm). V1 and V2 are subsets of V, with different subsets of the concepts and a different set of labels (gold standard (TSV) and TTL). The concept URI are stable in all versions of V.
Background: We want to aggressively tweak the vocabulary (reduce concepts and manipulate labels) for the mllm backend, to improve the results.
The text was updated successfully, but these errors were encountered: