You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Scaling Data-Constrained Language Models" is a very nice paper, and I learn a lot from this paper.
However, I have a question about this paper:
In the abstract and Figure 1, it recommends we should train 4 epochs.
But Figure 3 shows that we should choose 59 epochs.
So my question is why the optimal epoch is not 4 epochs in Figure 3.
Thanks in advance.
This is because of immense diminishing returns. So while you will be able to get better loss by training >4 epochs, returns diminish sharply (Figure 5 / attached). At 59 epochs, you're spending a lot of compute to get an extra tiny reduction in loss.
Meanwhile, at 4 epochs returns are still very close to the returns you would get from new data and your compute is well spent.
"Scaling Data-Constrained Language Models" is a very nice paper, and I learn a lot from this paper.
However, I have a question about this paper:
In the abstract and Figure 1, it recommends we should train 4 epochs.
But Figure 3 shows that we should choose 59 epochs.
So my question is why the optimal epoch is not 4 epochs in Figure 3.
Thanks in advance.
The text was updated successfully, but these errors were encountered: