New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: merge_insert update subcolumns #2639

Merged

wjones127 merged 15 commits into lancedb:main from wjones127:feat/merge-insert-subcols

Jul 30, 2024

Contributor

wjones127 commented Jul 24, 2024 •

edited

Loading

Closes #2610

Supports subschemas in merge_insert for updates only
- Inserts and deletes left as TODO
Field id -2 is now reserved as a field "tombstone". These tombstones are fields that are no longer in the schema, usually because those fields are now in a different data file.
Fixed a bug in Merger where statistics were reset on each batch.

github-actions bot added enhancement python labels

codecov-commenter commented Jul 26, 2024 •

edited

Loading

Codecov Report

Attention: Patch coverage is 87.24832% with 76 lines in your changes missing coverage. Please review.

Project coverage is 79.47%. Comparing base (aa92730) to head (b2627ce).
Report is 74 commits behind head on main.

Files	Patch %	Lines
rust/lance/src/dataset/write/merge_insert.rs	87.31%	20 Missing and 47 partials ⚠️
rust/lance-arrow/src/lib.rs	80.00%	1 Missing and 3 partials ⚠️
rust/lance-datafusion/src/dataframe.rs	0.00%	2 Missing ⚠️
rust/lance-datafusion/src/exec.rs	90.00%	0 Missing and 1 partial ⚠️
rust/lance/src/datafusion/dataframe.rs	95.65%	0 Missing and 1 partial ⚠️
rust/lance/src/dataset/fragment.rs	91.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2639      +/-   ##
==========================================
+ Coverage   79.36%   79.47%   +0.11%     
==========================================
  Files         222      223       +1     
  Lines       64588    65600    +1012     
  Branches    64588    65600    +1012     
==========================================
+ Hits        51258    52138     +880     
- Misses      10349    10431      +82     
- Partials     2981     3031      +50

Flag	Coverage Δ
unittests	`79.47% <87.24%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127 force-pushed the feat/merge-insert-subcols branch from 4064a72 to 03bdcef Compare

July 28, 2024 01:53

wjones127 added 13 commits

July 29, 2024 10:08


          tests

b1b7909

wip

75fa792


          implement write part

a63dc05


          Get main test passing

93b9a5a


          fix lifetime issue

7adc10c


          add some parallelism

f4eff4b


          enable time

37b949e


          fix parallelism and test

33fdb75


          better testing

3b1bdee


          Get it working with scalar indices

ade29d7


          fix unindexed frags

b975568


          fix python tests

49950e0


          cleanup

2cbb97f

wjones127 force-pushed the feat/merge-insert-subcols branch from c75207e to 2cbb97f Compare

July 29, 2024 17:09

wjones127 marked this pull request as ready for review

July 29, 2024 17:13

wjones127 requested review from westonpace and chebbyChefNEQ

July 29, 2024 17:13

westonpace approved these changes

View reviewed changes

Contributor

westonpace left a comment

This tackles the problem quite nicely, good work!

rust/lance-core/src/utils/tokio.rs Outdated

@@ @@ -35,6 +35,7 @@ lazy_static::lazy_static! { @@
                       .worker_threads(1)
                       // keep the thread alive "forever"
                       .thread_keep_alive(Duration::from_secs(u64::MAX))
+                      .enable_all()

Contributor

westonpace Jul 30, 2024

Why would the CPU runtime need the I/O subsystem enabled?

Contributor Author

wjones127 Jul 30, 2024

They have a mix of IO and CPU. I'm not sure how to separate them at this level. Should I just use the current runtime instead of the CPU runtime?

rust/lance-datafusion/src/exec.rs Outdated Show resolved Hide resolved

rust/lance-datafusion/src/exec.rs Show resolved Hide resolved

rust/lance/src/io/commit.rs Outdated Show resolved Hide resolved

rust/lance/src/dataset/write/merge_insert.rs

    
            @@ -428,7 +508,7 @@ impl MergeInsertJob {
          
                          HashJoinExec::try_new(

                              shared_input,

                              target,

                              vec![(Arc::new(target_key), Arc::new(source_key))],

                              vec![(Arc::new(source_key), Arc::new(target_key))],

Contributor

westonpace Jul 30, 2024

Ah, good catch. I guess we got away with it in the past since the position of the key field was always equal in the two schemas?

Contributor Author

wjones127 Jul 30, 2024

It seems so.

rust/lance/src/dataset/write/merge_insert.rs

+                      });
+                      let mut group_stream = session_ctx
+                          .read_one_shot(source)?
+                          .sort(vec![col(ROW_ADDR).sort(true, true)])?

Contributor

westonpace Jul 30, 2024

Hmm, clever to sort. I wonder if we want to do that in the original merge_join? Although I guess it changes the order of newly added rows.

Ah, I see, you are doing this for correctness and not performance. I was thinking it might speed up the indexed take.

rust/lance/src/dataset/write/merge_insert.rs Outdated

Comment on lines 797 to 799

+                                      // If there are no tasks running, we can bypass the pool limits.
+                                      memory_size = 0;
+                                      break;

Contributor

westonpace Jul 30, 2024

This is for the case where a single reservation is larger than our pool?

rust/lance/src/dataset/write/merge_insert.rs

+                          for data_file in &mut fragment.files.iter_mut().rev().skip(1) {
+                              for field in &mut data_file.fields {
+                                  if updated_fields.contains(field) {
+                                      // Tombstone these fields

Contributor

westonpace Jul 30, 2024

Do we tombstone fields in drop_columns?

Contributor Author

wjones127 Jul 30, 2024

We do not right now. We just remove the field from the schema.

The reason I introduced the tombstone is I wanted to be able to move the column without changing the schema. If we change the schema, that means other concurrent update / insert transactions would have to fail with commit conflict, which we don't want. We can't have duplicate ids in the field list of the files, so we need to change the old field locations to -2 so we can use the same field ids in the new files.

rust/lance/src/dataset/write/merge_insert.rs Outdated Show resolved Hide resolved

wjones127 added 2 commits

July 30, 2024 13:25


          pr feedback

14fa5a1


          remove IO features from CPU runtime

b2627ce

wjones127 merged commit 6ebeaa0 into lancedb:main

21 of 22 checks passed

eddyxu pushed a commit that referenced this pull request


          feat: merge_insert update subcolumns (#2639)

180914b

Closes #2610

* Supports subschemas in `merge_insert` for updates only
  * Inserts and deletes left as TODO
* Field id `-2` is now reserved as a field "tombstone". These tombstones
are fields that are no longer in the schema, usually because those
fields are now in a different data file.
* Fixed a bug in `Merger` where statistics were reset on each batch.

wjones127 mentioned this pull request

Support inserting in merge_insert with a subset of columns #2904

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement python