Intra-document masking - cross contamination and distractions during pre-training #690

biuq started this conversation in Ideas

biuq
Jul 16, 2024

Hello!

According to the paper Analysing The Impact of Sequence Composition on Language Model Pre-Training, it seems that instead of letting the model learn document boundaries with EOS/EOT tokens, it's "better" to prevent the attention from crossing document/sequence boundaries. This was also the method used in LLaMa 3 models:

We trained the models on sequences of 8,192 tokens, using a mask to ensure self-attention does not cross document boundaries.

I wonder whether this idea has already been explored for this repo?

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment