Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: upgrade DataFusion, Arrow, PyO3, ObjectStore #2594

Merged
merged 9 commits into from
Jul 16, 2024

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Jul 12, 2024

BREAKING CHANGE:

  • Removed access to MultipartId, cleanup_partial_writes. Uploads are now automatically cleaned up.
    • FragmentWriteProgress no longer has any multipart_id argument.
  • CommitHandler and ManifestWriter APIs now take a Lance ObjectStore rather than object_store::ObjectStore.

New features:

  • For S3 and Azure (these features were already implemented for GCS):
    • Max object size is now 2.5 TB, up from 50GB.
    • Retries are now performed for Connection reset network errors.
    • Reduced number of IOPS for writing small objects (<5MB)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this was ported from the now deleted gcs_wrapper.rs

Comment on lines +277 to 290
impl Drop for ObjectWriter {
fn drop(&mut self) {
// If there is a multipart upload started but not finished, we should abort it.
if matches!(self.state, UploadState::InProgress { .. }) {
// Take ownership of the state.
let state = std::mem::replace(&mut self.state, UploadState::Done);
if let UploadState::InProgress { mut upload, .. } = state {
tokio::task::spawn(async move {
let _ = upload.abort().await;
});
}
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what ensures partial uploads are cleaned up now. MultipartId is no longer exposed in the object-store API.

@codecov-commenter
Copy link

codecov-commenter commented Jul 13, 2024

Codecov Report

Attention: Patch coverage is 53.81679% with 242 lines in your changes missing coverage. Please review.

Project coverage is 80.22%. Comparing base (8cab512) to head (895d2ee).
Report is 1 commits behind head on main.

Files Patch % Lines
rust/lance-io/src/object_writer.rs 67.93% 71 Missing and 13 partials ⚠️
rust/lance-datagen/src/generator.rs 5.12% 37 Missing ⚠️
rust/lance-index/src/scalar/btree.rs 4.00% 23 Missing and 1 partial ⚠️
rust/lance/src/io/exec/planner.rs 34.48% 18 Missing and 1 partial ⚠️
rust/lance/src/io/exec/scalar_index.rs 15.38% 11 Missing ⚠️
rust/lance/src/io/exec/knn.rs 43.75% 9 Missing ⚠️
rust/lance-index/src/scalar/expression.rs 0.00% 6 Missing ⚠️
rust/lance-io/src/object_store.rs 72.22% 3 Missing and 2 partials ⚠️
rust/lance/src/io/exec/utils.rs 0.00% 5 Missing ⚠️
rust/lance-datafusion/src/exec.rs 0.00% 4 Missing ⚠️
... and 15 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2594      +/-   ##
==========================================
+ Coverage   79.97%   80.22%   +0.24%     
==========================================
  Files         212      211       -1     
  Lines       61556    61418     -138     
  Branches    61556    61418     -138     
==========================================
+ Hits        49228    49270      +42     
+ Misses       9402     9217     -185     
- Partials     2926     2931       +5     
Flag Coverage Δ
unittests 80.22% <53.81%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127 wjones127 marked this pull request as ready for review July 13, 2024 04:25
Cargo.toml Outdated Show resolved Hide resolved
python/python/lance/cleanup.py Show resolved Hide resolved
// Native types, which don't support Distribution.
// IntervalUnit::DayTime => rand::<IntervalDayTimeType>(),
// IntervalUnit::MonthDayNano => rand::<IntervalMonthDayNanoType>(),
IntervalUnit::DayTime | IntervalUnit::MonthDayNano => todo!(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@wjones127 wjones127 requested a review from BubbleCal July 15, 2024 20:22
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite exhausting. Thanks for the cleanup

Comment on lines -1666 to +1720
DataType::Interval(unit) => match unit {
IntervalUnit::DayTime => rand::<IntervalDayTimeType>(),
IntervalUnit::MonthDayNano => rand::<IntervalMonthDayNanoType>(),
IntervalUnit::YearMonth => rand::<IntervalYearMonthType>(),
},
DataType::Interval(unit) => rand_interval(*unit),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they changed the underlying native type to a struct rather than i128/i64. Thus Distribution was no longer implemented.

apache/arrow-rs#5769

Comment on lines -406 to -410
pub async fn cleanup_partial_writes(
store: &ObjectStore,
base_path: &Path,
objects: impl IntoIterator<Item = (&Path, &String)>,
) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why get rid of this? Was it no longer working? Or no longer needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object_store no longer exposes the multipart ids. So now users can't collect them and pass them into this function.

Instead, we automatically cleanup on Drop for ObjectWriter. In case of crashes, users should rely on setting a lifecycle rule to delete old incomplete writes. This is already recommended in our docs.

@wjones127 wjones127 merged commit 30af1d8 into lancedb:main Jul 16, 2024
22 checks passed
@wjones127 wjones127 deleted the feat/upgrade-datafusion branch July 16, 2024 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants