-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: merge_insert update subcolumns #2639
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2639 +/- ##
==========================================
+ Coverage 79.36% 79.47% +0.11%
==========================================
Files 222 223 +1
Lines 64588 65600 +1012
Branches 64588 65600 +1012
==========================================
+ Hits 51258 52138 +880
- Misses 10349 10431 +82
- Partials 2981 3031 +50
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
4064a72
to
03bdcef
Compare
c75207e
to
2cbb97f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tackles the problem quite nicely, good work!
rust/lance-core/src/utils/tokio.rs
Outdated
@@ -35,6 +35,7 @@ lazy_static::lazy_static! { | |||
.worker_threads(1) | |||
// keep the thread alive "forever" | |||
.thread_keep_alive(Duration::from_secs(u64::MAX)) | |||
.enable_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the CPU runtime need the I/O subsystem enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They have a mix of IO and CPU. I'm not sure how to separate them at this level. Should I just use the current runtime instead of the CPU runtime?
@@ -428,7 +508,7 @@ impl MergeInsertJob { | |||
HashJoinExec::try_new( | |||
shared_input, | |||
target, | |||
vec![(Arc::new(target_key), Arc::new(source_key))], | |||
vec![(Arc::new(source_key), Arc::new(target_key))], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. I guess we got away with it in the past since the position of the key field was always equal in the two schemas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems so.
}); | ||
let mut group_stream = session_ctx | ||
.read_one_shot(source)? | ||
.sort(vec![col(ROW_ADDR).sort(true, true)])? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, clever to sort. I wonder if we want to do that in the original merge_join? Although I guess it changes the order of newly added rows.
Ah, I see, you are doing this for correctness and not performance. I was thinking it might speed up the indexed take.
// If there are no tasks running, we can bypass the pool limits. | ||
memory_size = 0; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the case where a single reservation is larger than our pool?
for data_file in &mut fragment.files.iter_mut().rev().skip(1) { | ||
for field in &mut data_file.fields { | ||
if updated_fields.contains(field) { | ||
// Tombstone these fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we tombstone fields in drop_columns
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not right now. We just remove the field from the schema.
The reason I introduced the tombstone is I wanted to be able to move the column without changing the schema. If we change the schema, that means other concurrent update / insert transactions would have to fail with commit conflict, which we don't want. We can't have duplicate ids in the field list of the files, so we need to change the old field locations to -2 so we can use the same field ids in the new files.
Closes #2610 * Supports subschemas in `merge_insert` for updates only * Inserts and deletes left as TODO * Field id `-2` is now reserved as a field "tombstone". These tombstones are fields that are no longer in the schema, usually because those fields are now in a different data file. * Fixed a bug in `Merger` where statistics were reset on each batch.
Closes #2610
merge_insert
for updates only-2
is now reserved as a field "tombstone". These tombstones are fields that are no longer in the schema, usually because those fields are now in a different data file.Merger
where statistics were reset on each batch.