-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move Vocab & Embedding setup code into readers #337
Comments
This is also useful for writing more fine-grained tests that check things during training I quickly turned your code here into a test: https://github.com/uclmr/jack/blob/master/tests/jack/readers/test_fastqa_loop.py |
I think vocabs and emebddings could simply be created and loaded in the shared resources and not the modules. In the modules it would introduce unnecessary overhead for most implementations, because embedding handling is quite reader agnostic. |
I'd disagree. I have already seen readers that a) need or not need character embeddings, b) several vocabs for different parts (questions, answers, support), c) only use pre-trained embeddings etc. All of these seem like reasonable decisions that are reader-specific (or input module specific) This doesn't stop us from factoring out a lot of functionality, such that readers/input modules only use a single call to some |
I agree with the vocab part, but the embeddings can be provided separately already through the shared resources. |
I'd prefer if the reader was in charge of that too. This makes setting up readers more straightforward for the user. Right now I have to think about what I need to prepare (the embeddings), and what the reader is doing to populate the shared resources. And I need to look into the reader to decide whether it needs embeddings or not (if it doesn't, I shouldn't load embeddings). I would like the API usage to be
I also feel that even embedding loading can get convoluted (for example, if you decide to preload the embeddings for the training set, or if you want to load other types of pre-trained models etc). |
I don't have a strong opinion here and understand your point. It makes it also clear to the user where these resources are coming from when they are not auto-magically created somewhere else first. We had exactly this problem before when people did not understand where embeddings and vocabs actually come from and how they are created. Anyway, this is a major change so we should push that rather sooner than later. |
@riedelcastro @pminervini , anyone working on this? |
vocab and embeddings are now decoupled. The setup code should still move into the readers themselves though. |
I wrote some code that uses a jack reader (its model, pre and post-processing), but an own training loop---see below. This should be useful for others as a starting point for, say, using multi-task learning objectives etc. Personally, I think it's a good idea to think of this as the core way of interfacing jack. Everything else (a CLI training script etc) should be light wrapper around this.
I think we should work on streamlining this. What I dislike about the above is the handling of vocabs and embeddings. This is reader-specific, and should be handled internally. So I'd rather it looked like:
and then the input module (or model module) takes care of setting up vocabs and embeddings. Because don't have this separation, right now, vocab and embedding code creeps into learning code, even though it's specific to readers. The above would fix this.
To fix this, the reader-owners need to move embedding/vocab setup code into their modules.
Any thoughts or objections?
The text was updated successfully, but these errors were encountered: