Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: load table state lazily #1361

Open
wjones127 opened this issue May 12, 2023 · 0 comments
Open

refactor: load table state lazily #1361

wjones127 opened this issue May 12, 2023 · 0 comments
Labels
binding/python Issues for the Python package enhancement New feature or request
Milestone

Comments

@wjones127
Copy link
Collaborator

Right now, when we instantiate a Delta table, we load the entire table state into memory. For many workloads, we often don't need to have all of it, especially if we are only querying certain partitions at a time. Instead, we should ignore the add and remove actions when instantiating, and only load them as needed during scans.

When we read the log files, we should cache them on disk so we can quickly scan them again later for add and remove actions.

Instantiate table

  1. Verify _delta_log exists
  2. Either identify most recent version, or verify version requested exists
  3. Download relevant log files
  4. Scan for the table-level actions, such as metadata and protocol.

This gives you an instance of DeltaTable you can inspect and get table-level metadata. When asked for files, will run scan process below.

Scan table

  1. Using the partition filter, scan the log files to get the add and remove actions that are relevant.
  2. Use the actions to generate the file set
  3. Prune the files based on stats
  4. Scan all the files.

This kind of operation will need to run any time asked for a files.

In-memory caching

Below a certain threshold (which we could make configurable), it's probably fine to keep all the table state in-memory. So we might find some data structure that we can use as a memory-limited in-memory cache for table state. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants