Replies: 4 comments
-
Hi, Accessing ( a few! ) rows using a list of indices can be done with this trick import vaex
df = vaex.example()
df['index'] = vaex.vrange(0, len(df)) # This is virtual - does not use any memory, just used for tracking here
my_list= [0, 5, 100]
df[df.index.isin(my_list)] This might be slow if you have a very long list of indices and huge dataset. I am not familiar with PyTorch, so can't offer more specific advice. But maybe if you share a code example of what you are trying to achieve, it might be easier? Otherwise if you somehow convert the indices to a column of key words like ['train', 'test', 'val'] and then you can use simple filtering to get the subset of data you want? Also you might want to look through |
Beta Was this translation helpful? Give feedback.
-
My dataset will be about 50 million records, roughly 25gb. Making a column of set names (train, val, test) should be pretty easy. Good idea! Do you know if there are plans to support indexing with a list of indices in the future? Vaex-ml looks pretty cool. |
Beta Was this translation helpful? Give feedback.
-
I think |
Beta Was this translation helpful? Give feedback.
-
We can support indices, but they aren't that memory friendly, I think we really need a strong demand for it, which I don't see right now. |
Beta Was this translation helpful? Give feedback.
-
Hi!
Is it possible to select rows with a list of indices? Currently, I can select a single row with df[0] and a slice of rows with df[0:5]. But selecting rows 0, 3, and 5 with df[[0,3,5]] doesn't seem to work.
My scenario:
I have a dataset I'm using for ML. I have lists of indices for the train, val, and test sets. I want to compute standardization params (means and standard deviations) on the train set. I also want to use the Vaex Dataframe in a PyTorch Dataset. My current plan is to create a custom PyTorch Dataset where I access the correct DataFrame index in the PyTorch Dataset's
__getitem__
method. This can be done one sample at a time using df[index], but I suspect it will be more efficient to grab multiple samples at a time using df[indices] and PyTorch DataLoader'sbatch_sampler
.Thanks!
Beta Was this translation helpful? Give feedback.
All reactions