Organize our collation utils #1261

SalmanMohammadi · 2024-08-05T12:59:19Z

see #1005 for some context. From @ebsmothers (and @joecummings)@

I don't love our collate utilities rn. In an ideal world I want two simple abstractions: right_padded_collate and left_padded_collate, and any further padding built up on top of that. I don't think this version should exist at all (especially with its heavy dictionary usage)

RdoubleA · 2024-08-05T14:23:57Z

Can you get around using dictionaries if that's the expected return from the datasets and the expected input into models?

SalmanMohammadi · 2024-08-05T16:27:06Z

I guess the idea would be to have a general utility which applies a choice of padding function to a dict? The padding logic would be abstracted here

SalmanMohammadi · 2024-08-10T12:49:24Z

I'm going to bump this because I'd like to right-pad outputs from a dataset which are of the form:

{'tokens': [920,
  508,
  10110,
  13047,
  5786,
  1371,
  12566,
  592,
  2750,
  10110,
  278,
  615,
  ...],
 'labels': 1}

i.e. for a classification task. Similar to left_padded_collate I only want to right-pad the prompt tokens.

SalmanMohammadi · 2024-08-10T17:59:03Z

One proposal could be:

We define left_pad_sequence + right_pad_sequence with the signatures

def left_pad_sequence(List[torch.Tensor]) -> torch.Tensor:
    ...

def right_pad_sequence(List[torch.Tensor]) -> torch.Tensor:
    ...

Then, we have task-specific pad functionality:

def padded_collate_sft(
    batch: List[Dict[str, List[int]]],
    pad_fn: Callable[List[Any], torch.Tensor] = right_pad_sequence
    padding_idx: int = 0,
    ignore_idx: int = CROSS_ENTROPY_IGNORE_IDX
) -> Dict[str, torch.Tensor]:
    ...

def padded_collate_dpo(
    ...
):
    ...

i.e. largely the same but paramterize the pad sequence. This won't be exposed at the config level - it's just for recipe-writers and hackers.
Then, if we wanted to be really abusive, we could do something like:

from torchtune.., import right_padded_collate
from operator import itemgetter
from torch.utils.data import DataLoader
dl = DataLoader(ds, batch_size=4,
                collate_fn=lambda batch: 
                (right_padded_collate(map(itemgetter("tokens"), batch)), 
                 torch.Tensor(list(map(itemgetter("labels"), batch)))))

for classifier recipes, or even:

from torchtune.modules.rlhf.collate import left_padded_collate
from operator import itemgetter
from torch.utils.data import DataLoader
dl = DataLoader(ds, batch_size=4,
                collate_fn=lambda batch: left_padded_collate(map(itemgetter("tokens"), batch))
)

for PPO. I understand if people hate this and we could just define these as functions like normal people.

thoughts appreciated @RdoubleA @ebsmothers @joecummings

RdoubleA · 2024-08-26T19:19:29Z

Having utilities for left pad, right pad, and separate utilities for task specific collating makes sense. But do you need to make the pad_fn configurable for each of these? will users ever do left padding instead of right for SFT for example, and likewise for DPO?

All the collate utils should also just go under torchtune/data

SalmanMohammadi · 2024-08-26T19:24:53Z

🚫🚫 utils 🚫🚫

Yeah no need to make them configurable, you're right.

torchtune/data/_collate.py?

SalmanMohammadi added rfc Request for comments discussion Start a discussion labels Aug 5, 2024

RdoubleA mentioned this issue Aug 27, 2024

Move collate to data #1422

Merged

13 tasks

SalmanMohammadi mentioned this issue Aug 30, 2024

Re-organizing collation utils #1463

Merged

13 tasks

RdoubleA closed this as completed in #1463 Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Organize our collation utils #1261

Organize our collation utils #1261

SalmanMohammadi commented Aug 5, 2024

RdoubleA commented Aug 5, 2024

SalmanMohammadi commented Aug 5, 2024 •

edited

Loading

SalmanMohammadi commented Aug 10, 2024 •

edited

Loading

SalmanMohammadi commented Aug 10, 2024 •

edited

Loading

RdoubleA commented Aug 26, 2024 •

edited

Loading

SalmanMohammadi commented Aug 26, 2024 •

edited

Loading

Organize our collation utils #1261

Organize our collation utils #1261

Comments

SalmanMohammadi commented Aug 5, 2024

RdoubleA commented Aug 5, 2024

SalmanMohammadi commented Aug 5, 2024 • edited Loading

SalmanMohammadi commented Aug 10, 2024 • edited Loading

SalmanMohammadi commented Aug 10, 2024 • edited Loading

RdoubleA commented Aug 26, 2024 • edited Loading

SalmanMohammadi commented Aug 26, 2024 • edited Loading

SalmanMohammadi commented Aug 5, 2024 •

edited

Loading

SalmanMohammadi commented Aug 10, 2024 •

edited

Loading

SalmanMohammadi commented Aug 10, 2024 •

edited

Loading

RdoubleA commented Aug 26, 2024 •

edited

Loading

SalmanMohammadi commented Aug 26, 2024 •

edited

Loading