You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When working with list columns in parquet files, the PyTorch Data loader returns them in a specific tuple representation, with length 2, described as follows:
The contenated values of lists from all samples
The offsets, that allows to know when to break the values into the lists corresponding to each sample
That representation seems useful if the feature is categorical, and if the values will be fed directly to an torch.nn.EmbeddingBag() to lookup for the corresponding embeddings, as show here
But if the list columns are not categoricals (e.g. floats), or if you don't want to use EmbeddingBag for your categoricals, then you need to reconstruct manually the list columns into a PyTorch sparse tensor.
Describe the solution you'd like
It would be better if the NVTabular data loader provides both the offset representation (for EmbeddingBag) and a also sparse tensor representation for general usage.
I have implemented a Data loader extension that converts the current offset representation of list columns to a sparse or dense tensor, so that it is easier to use in PyTorch pipelines.
fromnvtabular.loader.torchimportTorchAsyncItrasNVTDataLoaderfromnvtabularimportDatasetasNVTDatasetclassNVTDataLoaderWrapper(NVTDataLoader):
def__init__(self, *args, **kwargs):
self.default_seq_features_len=Noneif'default_seq_features_len'inkwargs:
self.default_seq_features_len=kwargs.pop('default_seq_features_len')
else:
raiseValueError('NVTabular data loader requires the "default_seq_features_len" argument "'+\
'to create the sparse tensors for list columns')
super(NVTDataLoaderWrapper, self).__init__(*args, **kwargs)
def__enter__(self):
returnNonedef__exit__(self, type, value, traceback):
returnNonedef__next__(self):
cat_features, cont_features, label_features=super(NVTDataLoaderWrapper, self).__next__()
cat_sequence_features_transf= {}
ifcat_featuresisnotNone:
cat_single_features, cat_sequence_features=cat_featurescat_sequence_features_transf= {fname: self.get_sparse_tensor_list_column(cat_sequence_features[fname],
'categorical') \
forfnameincat_sequence_features}
cont_sequence_features_transf= {}
ifcont_featuresisnotNone:
cont_single_features, cont_sequence_features=cont_featurescont_sequence_features_transf= {fname: self.get_sparse_tensor_list_column(cont_sequence_features[fname],
'continuous') \
forfnameincont_sequence_features}
inputs= {**cat_sequence_features_transf, **cont_sequence_features_transf}
returninputsdefget_sparse_tensor_list_column(self, values_offset, feature_group):
values=values_offset[0].flatten()
offsets=values_offset[1].flatten()
num_rows=len(offsets)
#Appending the values length to the end of the offset vector, to be able to compute diff of the last sequenceoffsets=torch.cat([offsets, torch.LongTensor([len(values)]).to(offsets.device)])
#Computing the difference between consecutive offsets, to get the sequence lengthsdiff_offsets=offsets[1:] -offsets[:-1]
#Infering the number of cols based on the maximum sequence lengthmax_seq_len=int(diff_offsets.max())
default_seq_features_len=self.default_seq_features_lenifmax_seq_len>default_seq_features_len:
raiseValueError('The default sequence length has been configured to {}, but the '+\
'largest sequence in this batch have {} length'.format(self.default_seq_features_len,
max_seq_len))
#Building the indices to reconstruct the sparse tensorsrow_ids=torch.arange(len(offsets)-1).to(offsets.device)
row_ids_repeated=torch.repeat_interleave(row_ids, diff_offsets)
row_offset_repeated=torch.repeat_interleave(offsets[:-1], diff_offsets)
col_ids=torch.arange(len(row_offset_repeated)).to(offsets.device) -row_offset_repeated.to(offsets.device)
indices=torch.cat([row_ids_repeated.unsqueeze(-1), col_ids.unsqueeze(-1)], axis=1)
iffeature_group=='categorical':
sparse_tensor_class=torch.sparse.LongTensoreliffeature_group=='continuous':
sparse_tensor_class=torch.sparse.FloatTensorelse:
raiseNotImplementedError('Invalid feature group from NVTabular: {}'.format(feature_group))
sparse_tensor=sparse_tensor_class(indices.T, values, torch.Size([num_rows, default_seq_features_len]))
returnsparse_tensor
The data loader will return a dict whose keys are feature names and values are dense tensors (with lists padded with 0 up to the maximum defined length). It could also return the intermediate sparse tensor representation, for pipelines that can use it.
P.s. This class currently does not return "simple" (not list) columns, because there is no way currently to know the column names of the "simple" features (see #499 ). As soon as that is fixed, this class could also include in the return dict the "simple" columns and corresponding tensors.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
When working with list columns in parquet files, the PyTorch Data loader returns them in a specific tuple representation, with length 2, described as follows:
That representation seems useful if the feature is categorical, and if the values will be fed directly to an torch.nn.EmbeddingBag() to lookup for the corresponding embeddings, as show here
But if the list columns are not categoricals (e.g. floats), or if you don't want to use EmbeddingBag for your categoricals, then you need to reconstruct manually the list columns into a PyTorch sparse tensor.
Describe the solution you'd like
It would be better if the NVTabular data loader provides both the offset representation (for EmbeddingBag) and a also sparse tensor representation for general usage.
I have implemented a Data loader extension that converts the current offset representation of list columns to a sparse or dense tensor, so that it is easier to use in PyTorch pipelines.
An example using this extended Data Loader
The data loader will return a dict whose keys are feature names and values are dense tensors (with lists padded with 0 up to the maximum defined length). It could also return the intermediate sparse tensor representation, for pipelines that can use it.
P.s. This class currently does not return "simple" (not list) columns, because there is no way currently to know the column names of the "simple" features (see #499 ). As soon as that is fixed, this class could also include in the return dict the "simple" columns and corresponding tensors.
The text was updated successfully, but these errors were encountered: