Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize TensorFlow training steps by batch size #8

Closed
wants to merge 1 commit into from

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented Feb 11, 2022

This PR addresses a comment on the forums, where a reader was confused why num_train_steps wasn't normalized by the batch size.

Following the official TensorFlow examples in transformers, I've included this normalization in all the TensorFlow sections of the course. The only sections I didn't make a change are:

because in both those sections we have

num_train_steps = len(tf_train_dataset)

and I wasn't sure if this was intentional.

@lewtun lewtun requested a review from Rocketknight1 February 11, 2022 13:55
@Rocketknight1
Copy link
Member

Hi @lewtun! The tf.data.Dataset objects here already have a batch() operation applied - they're more like torch DataLoader objects. After a batch operation, their len() is num_samples // batch_size already, so we shouldn't need to do that division twice.

@Rocketknight1
Copy link
Member

That said, the comments are confusing because it isn't clear the division has already been performed. Maybe I should clarify them?

@lewtun
Copy link
Member Author

lewtun commented Feb 11, 2022

Ah thanks for the insight! I'll close this PR and it would be great if you could open a new one with a clarifying remark because the comment in the code is indeed confusing for TF noobs (myself included!)

As a side note, how come the official examples normalize by batch size, e.g. here? In that example, we have

num_train_steps=int(training_args.num_train_epochs * train_batches_per_epoch)

where

train_batches_per_epoch = len(train_dataset) // total_train_batch_size

It was reading the official examples that convinced me there was an error in the course 😄

@lewtun lewtun closed this Feb 11, 2022
@lewtun lewtun deleted the fix-train-steps branch February 11, 2022 15:47
@Rocketknight1
Copy link
Member

Agreed! In that example, train_dataset is the HuggingFace Dataset, so its len() is the number of samples. When we convert it to a TF tf.data.Dataset it becomes more like a DataLoader, so its len() is num_samples // batch_size. The confusion between the two types of Dataset object has been a problem for me, so I've tried to be clear and give the variables names like tf_train_dataset when I make the conversion, but I realize it's still quite opaque for people who aren't familiar.

@lewtun
Copy link
Member Author

lewtun commented Feb 11, 2022

Ah that makes more sense now!

CaterinaBi added a commit to CaterinaBi/hugging-face-course-translation that referenced this pull request Oct 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants