handling large table. #336

February24-Lee · 2024-01-09T10:18:09Z

February24-Lee
Jan 9, 2024

Hello, is there an efficient method or future development idea for handling large tables?

In my case, the table size is 1,442,792 x 12 with 2 categorical, 4 numerical, and 4 embedded-text columns. I cached all of the text with the OpenAI embedding API, so the dimension of each text is almost 1,500. The problem is that it consumes too much memory and it kills the process early (in fact, I found this was my mistake). It was inconvenient when conducting repetitive short experiments.

So, I modified the dataset and loader to execute convert_to_tensor_frame when called. Here is my code: https://github.com/February24-Lee/pytorch-frame/pull/1/files I thought this method was the simplest and required fewer modifications, though not fancy.

Anyway, I'm curious about your thoughts or future plans regarding handling large tables like mine.

Thank you.

yiweny · 2024-01-10T05:23:10Z

yiweny
Jan 10, 2024
Collaborator

I think one trick is to merge all the text columns and produce only 1 embedding from the merged columns.
As for your solution, I think lazy materialization is basically sacrificing time for memory right? Your GPU utilization will be hindered because every batch you load will take a lot of time for CPU to materialize.

These are my personal thoughts. Any comments are welcomed.

3 replies

February24-Lee Jan 10, 2024
Author

Thank you for your thoughts. I agree that memory often requires time sacrifices, and merging text seems like a solid solution.

However, I'm curious if the framework has a method that directly reduces data, which could potentially address my problem. Even with smaller text sizes, I believe the same issue might arise with larger data volumes (if, for example, both the df and tensor_frame can't be loaded into memory simultaneously).

Regarding col_state, I tried calculating it separately, manually applying it to the tf.data.Dataset, and then lazily materializing it. Yet, I had to modify the code due to a process in the tf.data.dataloader that materializes the entire dataset again.

weihua916 Jan 12, 2024
Collaborator

This is an interesting discussion! One way is to chunk the original large df into df_chunk_list, obtain the converter with the initial chunk, and reuse the converter in the subsequent chunks. See the pseudo code below.

# random chunking of df into df_chunk_list
for i, df_chunk in enumerate(df_chunk_list):
    if i == 0:
        dataset = Dataset(df_chunk, col_to_stype=col_to_stype, ...)
        dataset.materialize(path = f"chunk_{i}.pt")
        converter = dataset_sampled.convert_to_tensor_frame
    else:
        tf_chunk = converter(df_chunk)
        torch_frame.save(tf_chunk, f"chunk_{i}.pt")

weihua916 Jan 12, 2024
Collaborator

Then you can design some customized PyTorch data loader that operates over a list of chunked tensor frame objects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handling large table. #336

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

handling large table. #336

February24-Lee Jan 9, 2024

Replies: 1 comment · 3 replies

yiweny Jan 10, 2024 Collaborator

February24-Lee Jan 10, 2024 Author

weihua916 Jan 12, 2024 Collaborator

weihua916 Jan 12, 2024 Collaborator

February24-Lee
Jan 9, 2024

Replies: 1 comment 3 replies

yiweny
Jan 10, 2024
Collaborator

February24-Lee Jan 10, 2024
Author

weihua916 Jan 12, 2024
Collaborator

weihua916 Jan 12, 2024
Collaborator