Question about strides managing. #385

m7mdhka · 2024-10-05T13:17:30Z

m7mdhka
Oct 5, 2024

Hello,
I read this in Chapter (2):
"Note that we increase the stride to 4. This is to utilize the data set fully (we don’t skip a single word) but also to avoid any overlap between the batches since more overlap could lead to increased overfitting. "

I want to ask why if there is overlapping in the dataset it causes overfitting? Isn't it better? So that the language model learns from each word present and its contextual relationship to the previous sentence?

Answered by rasbt

Oct 6, 2024

Hi there,
that's a good question, and it actually a bit of a tricky topic. But even with the stride > 1 the LLM sees every word in the text. It's just that it doesn't see each word multiple times.

E.g., consider the following example:

Input Sentence:
"Hello world, this is an example of a batch input sequence."

with stride=6

Batch Inputs:

tensor([
  ["Hello", "world,", "this", "is", "an", "example"],
  ["of", "a", "batch", "input", "sequence."]
])

Batch Targets (inputs shifted by +1):

tensor([
  ["world,", "this", "is", "an", "example", "of"],
  ["a", "batch", "input", "sequence."]
])

with stride =1

Batch Inputs:

tensor([
  ["Hello", "world,", "this", "is", "an", "example"],
  ["world,",…

View full answer

rasbt · 2024-10-06T21:51:21Z

rasbt
Oct 6, 2024
Maintainer

Hi there,
that's a good question, and it actually a bit of a tricky topic. But even with the stride > 1 the LLM sees every word in the text. It's just that it doesn't see each word multiple times.

E.g., consider the following example:

Input Sentence:
"Hello world, this is an example of a batch input sequence."

with stride=6

Batch Inputs:

tensor([
  ["Hello", "world,", "this", "is", "an", "example"],
  ["of", "a", "batch", "input", "sequence."]
])

Batch Targets (inputs shifted by +1):

tensor([
  ["world,", "this", "is", "an", "example", "of"],
  ["a", "batch", "input", "sequence."]
])

with stride =1

Batch Inputs:

tensor([
  ["Hello", "world,", "this", "is", "an", "example"],
  ["world,", "this", "is", "an", "example", "of"],
  ["this", "is", "an", "example", "of", "a"],
  ["is", "an", "example", "of", "a", "batch"]
  ...
])

Batch Targets:

tensor([
  ["world,", "this", "is", "an", "example", "of"],
  ["this", "is", "an", "example", "of", "a"],
  ["is", "an", "example", "of", "a", "batch"],
  ["an", "example", "of", "a", "batch", "input"]
  ...
])

Please let me know in case you have any follow-up questions

3 replies

m7mdhka Oct 6, 2024
Author

When I started reading Chapter 3, I came to this conclusion:
The plot is not about the Next word prediction, it is about the Self Attention mechanism.
It assigns weights to each token with the other tokens in the input sequence.
So having overlapped between the input sequences would be useless because the weights in the overlapped sequences might be similar, and it would be redundant and would increase the training time of the model.
Is this conclusion correct?

rasbt Oct 7, 2024
Maintainer

This is a good thought, but the answer is actually simpler: it increase redundancy because the same tokens occur in multiple input sequences (rows) if you have the overlap. E.g., if you look at the input examples above, if there is overlap with the stride, each row contains almost identical tokens (except for 1 token that is different). This would be very inefficient during training as a lot of information becomes repeated.

Regarding the self-attention mechanism, the attention is only within each sequence and wouldn't span multiple sequences (rows in the example above).

m7mdhka Oct 7, 2024
Author

Thank you, this answer is more clear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about strides managing. #385

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question about strides managing. #385

m7mdhka Oct 5, 2024

with stride=6

with stride =1

Replies: 1 comment · 3 replies

rasbt Oct 6, 2024 Maintainer

with stride=6

with stride =1

m7mdhka Oct 6, 2024 Author

rasbt Oct 7, 2024 Maintainer

m7mdhka Oct 7, 2024 Author

m7mdhka
Oct 5, 2024

Replies: 1 comment 3 replies

rasbt
Oct 6, 2024
Maintainer

m7mdhka Oct 6, 2024
Author

rasbt Oct 7, 2024
Maintainer

m7mdhka Oct 7, 2024
Author