Move loss generating token counting to the dataloader #1632

dakinggg · 2024-11-02T20:58:29Z

The previous PR to count loss generating tokens caused a throughput regression because the token counting is on the critical path in the Composer trainer. This PR moves the token counting into the collator so that it can be hidden in the dataloader. It does this by counting tokens per row in the batch, using the existing token counting function (not the most efficient, but doesn't matter because this is not dataloader bottlenecked, and allows reusing the existing code).

This PR should not change token counts or loss, and will barely affect throughput for larger models, but have a significant impact for smaller models.

125m model pretraining throughput, loss, token count, before and after this PR:

llama finetune throughput, loss, token count, before and after this PR:

llmfoundry/data/contrastive_pairs/dataloader.py

llmfoundry/data/utils.py

dakinggg added 15 commits November 2, 2024 13:06

fix

99177b5

try

14751e2

fix

81f6174

rm

9d55798

put back

0f3cbc0

fix

12e4f02

tensors

4a425b5

fix

3d207a0

fix

fd86108

update

a21c781

add to ft

b4b9349

pc

2f75f19

pc

d92d535

fix embedding code

120853f

fix test and clean up;

ad55425

dakinggg changed the title ~~WIP~~ Move loss generating token counting to the dataloader Nov 2, 2024

pc

a402422

dakinggg marked this pull request as ready for review November 3, 2024 00:01

dakinggg requested a review from a team as a code owner November 3, 2024 00:01

dakinggg requested review from mvpatel2000 and irenedea November 3, 2024 00:01

mvpatel2000 reviewed Nov 3, 2024

View reviewed changes

llmfoundry/data/contrastive_pairs/dataloader.py Show resolved Hide resolved

llmfoundry/data/utils.py Outdated Show resolved Hide resolved

llmfoundry/data/utils.py Show resolved Hide resolved

mvpatel2000 approved these changes Nov 3, 2024

View reviewed changes

dakinggg added 3 commits November 3, 2024 16:01

pr comments

b0ddad7

pc

7a7055b

pc

ec2109d

dakinggg merged commit ce302be into mosaicml:main Nov 4, 2024
9 checks passed

dakinggg added a commit that referenced this pull request Nov 4, 2024

Move loss generating token counting to the dataloader (#1632)

4194a38

dakinggg added a commit that referenced this pull request Nov 4, 2024

Move loss generating token counting to the dataloader (#1632)

47dd036

gupta-abhay mentioned this pull request Nov 6, 2024

Changes for removing unused terms in CE loss fn #1643

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move loss generating token counting to the dataloader #1632

Move loss generating token counting to the dataloader #1632

dakinggg commented Nov 2, 2024 •

edited

Loading

Move loss generating token counting to the dataloader #1632

Move loss generating token counting to the dataloader #1632

Conversation

dakinggg commented Nov 2, 2024 • edited Loading

dakinggg commented Nov 2, 2024 •

edited

Loading