Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: merge_insert update subcolumns #2639

Merged
merged 15 commits into from
Jul 30, 2024

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Jul 24, 2024

Closes #2610

  • Supports subschemas in merge_insert for updates only
    • Inserts and deletes left as TODO
  • Field id -2 is now reserved as a field "tombstone". These tombstones are fields that are no longer in the schema, usually because those fields are now in a different data file.
  • Fixed a bug in Merger where statistics were reset on each batch.

@github-actions github-actions bot added enhancement New feature or request python labels Jul 24, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jul 26, 2024

Codecov Report

Attention: Patch coverage is 87.24832% with 76 lines in your changes missing coverage. Please review.

Project coverage is 79.47%. Comparing base (aa92730) to head (b2627ce).
Report is 74 commits behind head on main.

Files Patch % Lines
rust/lance/src/dataset/write/merge_insert.rs 87.31% 20 Missing and 47 partials ⚠️
rust/lance-arrow/src/lib.rs 80.00% 1 Missing and 3 partials ⚠️
rust/lance-datafusion/src/dataframe.rs 0.00% 2 Missing ⚠️
rust/lance-datafusion/src/exec.rs 90.00% 0 Missing and 1 partial ⚠️
rust/lance/src/datafusion/dataframe.rs 95.65% 0 Missing and 1 partial ⚠️
rust/lance/src/dataset/fragment.rs 91.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2639      +/-   ##
==========================================
+ Coverage   79.36%   79.47%   +0.11%     
==========================================
  Files         222      223       +1     
  Lines       64588    65600    +1012     
  Branches    64588    65600    +1012     
==========================================
+ Hits        51258    52138     +880     
- Misses      10349    10431      +82     
- Partials     2981     3031      +50     
Flag Coverage Δ
unittests 79.47% <87.24%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127 wjones127 marked this pull request as ready for review July 29, 2024 17:13
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tackles the problem quite nicely, good work!

@@ -35,6 +35,7 @@ lazy_static::lazy_static! {
.worker_threads(1)
// keep the thread alive "forever"
.thread_keep_alive(Duration::from_secs(u64::MAX))
.enable_all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the CPU runtime need the I/O subsystem enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have a mix of IO and CPU. I'm not sure how to separate them at this level. Should I just use the current runtime instead of the CPU runtime?

rust/lance-datafusion/src/exec.rs Outdated Show resolved Hide resolved
rust/lance-datafusion/src/exec.rs Show resolved Hide resolved
rust/lance/src/io/commit.rs Outdated Show resolved Hide resolved
@@ -428,7 +508,7 @@ impl MergeInsertJob {
HashJoinExec::try_new(
shared_input,
target,
vec![(Arc::new(target_key), Arc::new(source_key))],
vec![(Arc::new(source_key), Arc::new(target_key))],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. I guess we got away with it in the past since the position of the key field was always equal in the two schemas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems so.

});
let mut group_stream = session_ctx
.read_one_shot(source)?
.sort(vec![col(ROW_ADDR).sort(true, true)])?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, clever to sort. I wonder if we want to do that in the original merge_join? Although I guess it changes the order of newly added rows.

Ah, I see, you are doing this for correctness and not performance. I was thinking it might speed up the indexed take.

Comment on lines 797 to 799
// If there are no tasks running, we can bypass the pool limits.
memory_size = 0;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the case where a single reservation is larger than our pool?

for data_file in &mut fragment.files.iter_mut().rev().skip(1) {
for field in &mut data_file.fields {
if updated_fields.contains(field) {
// Tombstone these fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we tombstone fields in drop_columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not right now. We just remove the field from the schema.

The reason I introduced the tombstone is I wanted to be able to move the column without changing the schema. If we change the schema, that means other concurrent update / insert transactions would have to fail with commit conflict, which we don't want. We can't have duplicate ids in the field list of the files, so we need to change the old field locations to -2 so we can use the same field ids in the new files.

rust/lance/src/dataset/write/merge_insert.rs Outdated Show resolved Hide resolved
@wjones127 wjones127 merged commit 6ebeaa0 into lancedb:main Jul 30, 2024
21 of 22 checks passed
eddyxu pushed a commit that referenced this pull request Jul 31, 2024
Closes #2610

* Supports subschemas in `merge_insert` for updates only
  * Inserts and deletes left as TODO
* Field id `-2` is now reserved as a field "tombstone". These tombstones
are fields that are no longer in the schema, usually because those
fields are now in a different data file.
* Fixed a bug in `Merger` where statistics were reset on each batch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update subset of columns in merge_insert
3 participants