Skip to content

Conversation

blaginin
Copy link
Contributor

@blaginin blaginin commented Mar 11, 2025

Which issue does this PR close?

Rationale for this change

Switch to insta snapshots so review and update will be easier

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Mar 11, 2025
@github-actions github-actions bot added the common Related to common crate label Mar 11, 2025
actual.trim().to_string()
}

pub fn batches_to_sort_string(batches: &[RecordBatch]) -> String {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's alternative way: use https://docs.rs/insta/latest/insta/struct.Settings.html#method.sort_maps

this, however, will require switching to serializable format - which we dont want - so leaving as is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using strings for comparison works well - let's try this

@@ -429,16 +434,16 @@ async fn drop_with_quotes() -> Result<()> {

let df_results = df.collect().await?;

assert_batches_sorted_eq!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(here and everywhere else) batches_to_sort_string order has changed changed because rhs is order-sensitive (since it's a string). there's a way to overcome this (https://github.com/apache/datafusion/pull/15165/files#r1990156623) but IMO as-is is better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we first sort the batches and then convert to strings? so the orders are preserved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do sort them before converting to strings. Or maybe I missed your point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are sorting strings, not rows of the batches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, true. that can be better - but i think the order is preserved even now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized now assert_batches_sorted_eq() does also the same. I'd just like to avoid the strange ordering like

  +------+
  | f"c2 |
  +------+
  | 11   |
  | 2    |
  +------+

even if we say batches_to_sort_string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is very very fair! initially i wanted to sort by all columns present in the df - but the issue is that not all types are sortable

I also considered sorting by something like https://docs.rs/human-sort/latest/human_sort/ - but then again, it's +1 extra dev dependency...

@blaginin blaginin marked this pull request as ready for review March 11, 2025 21:35
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @blaginin -- I think this looks great! Though I think we should leave this open for a while to gather more feedback before we merge it

For anyone following along, to update these results what you do is run cargo insta review

200w

@blaginin maybe we can write up / make a screen cast showing the update process?

FYI @jayzhan211 @berkaysynnada @Omega359 . What do you think?

actual.trim().to_string()
}

pub fn batches_to_sort_string(batches: &[RecordBatch]) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using strings for comparison works well - let's try this

@@ -429,16 +434,16 @@ async fn drop_with_quotes() -> Result<()> {

let df_results = df.collect().await?;

assert_batches_sorted_eq!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb changed the title Add insta for df testing Use insta for DataFrame tests Mar 11, 2025
@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 12, 2025

Great @blaginin! I think I can revisit #10363 again which requires MANUALLY updating a lot of Rust tests

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests seem much better now, thank you @blaginin. I left 2 suggestions.

BTW, do you have any script etc. to make auto-conversions? or we have to update all others by hand?


// Get string representation of the plan
async fn assert_physical_plan(df: &DataFrame, expected: Vec<&str>) {
async fn physical_plan_to_string(df: &DataFrame) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name sounds like (Arc<dyn ExecutionPlan>) -> (String). Perhaps we can make this a method of Dataframe, and call it something like "to_physical_plan_string". WDYT?

@@ -429,16 +434,16 @@ async fn drop_with_quotes() -> Result<()> {

let df_results = df.collect().await?;

assert_batches_sorted_eq!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we first sort the batches and then convert to strings? so the orders are preserved

@berkaysynnada
Copy link
Contributor

The failure display is also quite nice:
image

@blaginin
Copy link
Contributor Author

blaginin commented Mar 12, 2025

Created an epic for subsequent changes, wrote my approach and automation ideas there

@jayzhan211
Copy link
Contributor

Screenshot 2025-03-12 at 8 31 37 PM

There is even a button to update the test! I guess it is from rust-analyzer 🤔

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 12, 2025

I think we also need to update tests in datafusion/core/tests/dataframe/dataframe_functions.rs, maybe in another PR?

@jayzhan211
Copy link
Contributor

Great, I saw #15178

@Omega359
Copy link
Contributor

I think this is a definite improvement!

@alamb
Copy link
Contributor

alamb commented Mar 12, 2025

I updated this PR to say it closes this ticket. Let me know if that isn't right

@alamb
Copy link
Contributor

alamb commented Mar 14, 2025

I merged this branch up from main to rerun the tests. Once CI passes I'll plan to merge this in

@alamb alamb merged commit a3980db into apache:main Mar 14, 2025
29 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 14, 2025

🚀

@alamb
Copy link
Contributor

alamb commented Mar 14, 2025

Thanks again @blaginin

Do you plan to organize some more migration as part of

It will likely be a good exercise to learn how to get projects tee'd up for the community. I recommend making one or two tickets focused on a particular set of tests and see if others can follow the pattern.

Once it is clear they can, then we can apply the copy/paste method to create a bunch of good first issue tickets for the other files that need porting

This is going to be so good

@blaginin
Copy link
Contributor Author

Thank you!! Yes, absolutely, i’m going to create a few good first issues now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auto-update mechanism for dataframe test

6 participants