Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlapping sentences with long texts exceeding max_token_per_batch #37

Open
davidberenstein1957 opened this issue Jun 19, 2023 · 6 comments

Comments

@davidberenstein1957
Copy link

Hi,

I used to work a lot with coreference for longer texts and I think it would be a nice addition to overlap sentences to have a more robust model w.r.t. longer texts. I also want to work on this.

Regards,
David

@shon-otmazgin
Copy link
Owner

Hello @davidberenstein1957,

To do overlap between sentences to have more attention between segments? if yes, recent works (I think the one using BERT for coreference) showed it is not necessary, also it comes with more computation time.

@davidberenstein1957
Copy link
Author

No overlap, as in that your entire text might not fit into (GPU) memory.

@shon-otmazgin
Copy link
Owner

can you share more details? if you set max_tokens_in_batch to your longest doc in the dataset is it still OOM?

@davidberenstein1957
Copy link
Author

davidberenstein1957 commented Jun 19, 2023

Similarly, when you exceed the length of the 'max_tokens' for the transformer used, it might still be interesting to use the last 'x' sentences and use that prepended text for the next chunk, so that you can infer some knowledge from that batch and can later on merge the clusters if they contain the same spans.

@shon-otmazgin
Copy link
Owner

If I understand correctly, you want want to overlap between batches? I can't understand the benefit of it.

@davidberenstein1957
Copy link
Author

let's say you have a text of length 3x, where the maximum number of tokens in a single pass is 2x, then it might make sense to allow for passing this text in segments 1:2 and segments 2:3. Afterwards, you could re-align/merge coref clusters based on the overlapping sentences in segment 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants