chapter 6 `calc_loss_batch` and `calc_accuracy_loader` #434

zia-hasan · 2024-11-14T07:05:21Z

zia-hasan
Nov 14, 2024

I have a question about the implementation of and calc_loss_batch (applies to calc_accuracy_loader as well). I see the following implementation:

def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
    logits = model(input_batch)[:, -1, :]  # Logits of last output token
    loss = torch.nn.functional.cross_entropy(logits, target_batch)
    return loss

Could this be potentially be an issue since we're always taking the last token position (-1) regardless of the actual length of the input text. In the current implementation, we might be using the representation of a padding token for classification. Wouldn't something like this be more accurate?

def calc_loss_batch(input_batch, target_batch, model, device):
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
    
    # Find the last non-padding token for each sequence in the batch
    pad_token_id = 50256  # <|endoftext|> token used for padding
    mask = input_batch != pad_token_id
    last_token_pos = mask.sum(dim=1) - 1  # Get position of last real token
    
    # Get model outputs
    logits = model(input_batch)  # shape: [batch_size, seq_len, num_classes]
    
    # Select the logits corresponding to the last real token of each sequence
    batch_size = logits.size(0)
    selected_logits = logits[torch.arange(batch_size), last_token_pos]
    
    loss = torch.nn.functional.cross_entropy(selected_logits, target_batch)
    return loss

def calc_accuracy_loader(data_loader, model, device, num_batches=None):
    model.eval()
    correct_predictions, num_examples = 0, 0

    if num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            input_batch, target_batch = input_batch.to(device), target_batch.to(device)

            # Find the last non-padding token for each sequence in the batch
            pad_token_id = 50256  # <|endoftext|> token used for padding
            mask = input_batch != pad_token_id
            last_token_pos = mask.sum(dim=1) - 1  # Get position of last real token

            with torch.no_grad():
                logits = model(input_batch)  # Logits of last output token
                # Select the logits corresponding to the last real token of each sequence
                batch_size = logits.size(0)
                selected_logits = logits[torch.arange(batch_size), last_token_pos]
                predicted_labels = torch.argmax(selected_logits, dim=-1)

            num_examples += predicted_labels.shape[0]
            correct_predictions += (predicted_labels == target_batch).sum().item()
        else:
            break
    return correct_predictions / num_examples

When we are actually making predictions for one sequence, then we always use last token (non-padding). I see this as a mismatch between train and test time. Can someone shed some light on this?

rasbt · 2024-11-14T10:32:12Z

rasbt
Nov 14, 2024
Maintainer

That's a great observation, and I agree that this could be an issue. However, the data loader is set up such that it pads all sequences to equal length, even for the validation and test loaders:

val_dataset = SpamDataset(
    csv_file="validation.csv",
    max_length=train_dataset.max_length,  # <-------
    tokenizer=tokenizer 
)
test_dataset = SpamDataset(
    csv_file="test.csv",
    max_length=train_dataset.max_length,   # <-------
    tokenizer=tokenizer
)

So, the -1 token is always in the same position.

I've run some experiments without padding (see row 15 here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch06/02_bonus_additional-experiments) and yes, it can indeed perform better. (This is somewhat analogous to your suggestion.)

3 replies

zia-hasan Nov 14, 2024
Author

@rasbt Thanks for the response! The training/test/validation setup looks good. However, in Chapter 6’s notebook, in cell [41], I noticed that the classify_review function uses the last token for classification. This seems like it could introduce a slight distribution shift at inference time since, since most sequences during training are probably shorter than the max_length, the last token would likely be a padding token. However, I imagine it could still work, as causal attention might allow information to flow into the padding token.

Regarding your experiments, I noticed that with no_padding=True, only batch_size=1 can be used in the current code. But, with the trick I mentioned above, you could enable any batch_size while maintaining an equivalent formulation to no_padding. Probably will give similar higher accuracy as row 15 with more stable convergence with mini batch GD . Just an observation.

Let me know if this makes sense?

Quick Update. I ran a quick test. With original Implementation got this:

Training accuracy: 97.21%
Validation accuracy: 97.32%
Test accuracy: 95.67%

With the suggested change above got a bump in test accuracy by 1%.

Training accuracy: 99.23%
Validation accuracy: 97.32%
Test accuracy: 96.67%

rasbt Nov 14, 2024
Maintainer

Thanks for the follow-up, and these are good points!

I noticed that the classify_review function uses the last token for classification. This seems like it could introduce a slight distribution shift at inference time since

Yes and no. I agree that it could potentially be a problem, but that's why the function has a "max_length" parameter to pad the input accordingly:

print(classify_review(
      text_1, model, tokenizer, device,
      max_length=train_dataset.max_length  # <------
))

with more stable convergence with mini batch GD .

Note that my batch_size=1 implementation uses gradient accumulation, so the effective batch size should be the same. Or in other words, it should be mathematically equivalent (except for a tiny discrepancy due to the token number thing discussed here: https://unsloth.ai/blog/gradient). But yeah, your implementation should be faster for sure.

I added this as an additional experiment (row 16) and it's indeed the same accuracy as with batch size 1 and gradient accumulation. But as you can see, it's much faster 😊

	Model	Weights	Trainable token position	Trainable layers	Context length	Training acc	Validation acc	Test acc	Training time	CPU/GPU
14	gpt2-small (124M)	pretrained	last	last_block	variable: no padding (batch size 1)	100.00%	98.66%	98.00%	1.75 min	A100
15	gpt2-small (124M)	pretrained	last	last_block	variable: no padding (batch size 8)	99.33%	98.66%	98.33%	1.70 min	A100
16	gpt2-small (124M)	pretrained	last	last_block	flexible (last non-padding position)	99.42%	98.66%	98.33%	0.30 min	A100

Thanks for this very good discussion and suggestion! I added the code for this here: #438

zia-hasan Nov 14, 2024
Author

Awesome! missed the max_length parameter. Makes total sense. Thanks for running the experiments as well and also for writing such a clean and wonderful book!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chapter 6 `calc_loss_batch` and `calc_accuracy_loader` #434

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

chapter 6 calc_loss_batch and calc_accuracy_loader #434

zia-hasan Nov 14, 2024

Replies: 1 comment · 3 replies

rasbt Nov 14, 2024 Maintainer

zia-hasan Nov 14, 2024 Author

rasbt Nov 14, 2024 Maintainer

zia-hasan Nov 14, 2024 Author

chapter 6 `calc_loss_batch` and `calc_accuracy_loader` #434

zia-hasan
Nov 14, 2024

Replies: 1 comment 3 replies

rasbt
Nov 14, 2024
Maintainer

zia-hasan Nov 14, 2024
Author

rasbt Nov 14, 2024
Maintainer

zia-hasan Nov 14, 2024
Author