Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add lazy loading of tables #3

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

PengLiVectra
Copy link
Collaborator

@PengLiVectra PengLiVectra commented Nov 8, 2023

Description

Add lazy loading of tables. When preforming streaming operations we don't need any version of the table loaded. Large tables are slow to load, so we see a huge performance boost by avoiding CPU time spent loading the table metadata.

Related Issue(s)

Partly with delta-io#1361

Testing

Breaking Change

Not a breaking change.

Documentation

@@ -247,7 +248,7 @@ def __init__(
This can decrease latency if there are many files in the log since the last checkpoint,
but will also increase memory usage. Possible rate limits of the storage backend should
also be considered for optimal performance. Defaults to 4 * number of cpus.

:param lazy_load: when true the table metadata isn't loaded
"""
self._storage_options = storage_options
self._table = RawDeltaTable(
Copy link

@ginevragaudioso ginevragaudioso Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table and metadata initialization are already below, I think we should remove them from here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and updates docstring and functions

@PengLiVectra PengLiVectra force-pushed the add-lazy-loading branch 2 times, most recently from a1cd31d to e77089f Compare November 9, 2023 12:38
@ginevragaudioso
Copy link

@dsandesari do you think we should add unit tests for this? On rust side and/or python side?

@@ -121,6 +121,43 @@ impl RawDeltaTable {
})
}

#[classmethod]
#[pyo3(signature = (table_uri, version = None, storage_options = None, without_files = false))]
fn load_lazy(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between load_lazy and new above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_lazy uses builder.build(), while new uses builder.load(). builder.load() build the DeltaTable and load its state. builder.build() only build DeltaTable. See details here: https://github.com/delta-io/delta-rs/blob/main/crates/deltalake-core/src/table/builder.rs#L269-L293

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need to include all the options in load_lazy

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a test with options and without options for load_lazy to ensure it works as expected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@syedashrafulla syedashrafulla Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PengLiVectra for links like the line links, make sure you anchor GitHub to a tag so the stanza is consistent. RN the highlight highlights something else instead of build and load. Looks good otherwise though. We could consider DRY by pulling out the first part of each function (before we call build vs load). Up to you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consider it, most of the code seems repetitive.

@syedashrafulla
Copy link

How is this work related to github/delta-io/delta-rs/issues/1361?

@ginevragaudioso
Copy link

Do we know why the tests are not passing?

@PengLiVectra PengLiVectra force-pushed the add-lazy-loading branch 2 times, most recently from 9ca9531 to c717f5d Compare November 28, 2023 15:57
log_buffer_size=log_buffer_size,
)
self._metadata = None
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a test, using lazy_load.

@@ -121,6 +121,43 @@ impl RawDeltaTable {
})
}

#[classmethod]
#[pyo3(signature = (table_uri, version = None, storage_options = None, without_files = false))]
fn load_lazy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a test with options and without options for load_lazy to ensure it works as expected.

@PengLiVectra PengLiVectra force-pushed the add-lazy-loading branch 2 times, most recently from 83478f5 to 6ba7634 Compare December 5, 2023 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants