-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Use insta for DataFrame
tests
#15165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
actual.trim().to_string() | ||
} | ||
|
||
pub fn batches_to_sort_string(batches: &[RecordBatch]) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's alternative way: use https://docs.rs/insta/latest/insta/struct.Settings.html#method.sort_maps
this, however, will require switching to serializable format - which we dont want - so leaving as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using strings for comparison works well - let's try this
@@ -429,16 +434,16 @@ async fn drop_with_quotes() -> Result<()> { | |||
|
|||
let df_results = df.collect().await?; | |||
|
|||
assert_batches_sorted_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(here and everywhere else) batches_to_sort_string
order has changed changed because rhs is order-sensitive (since it's a string). there's a way to overcome this (https://github.com/apache/datafusion/pull/15165/files#r1990156623) but IMO as-is is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something @crepererum did in influxdb_iox was to create a new that wraps the results so the tests look like this:
Note sure if that is 100% relevant but it is a neat pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we first sort the batches and then convert to strings? so the orders are preserved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do sort them before converting to strings. Or maybe I missed your point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are sorting strings, not rows of the batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, true. that can be better - but i think the order is preserved even now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized now assert_batches_sorted_eq()
does also the same. I'd just like to avoid the strange ordering like
+------+
| f"c2 |
+------+
| 11 |
| 2 |
+------+
even if we say batches_to_sort_string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is very very fair! initially i wanted to sort by all columns present in the df - but the issue is that not all types are sortable
I also considered sorting by something like https://docs.rs/human-sort/latest/human_sort/ - but then again, it's +1 extra dev dependency...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @blaginin -- I think this looks great! Though I think we should leave this open for a while to gather more feedback before we merge it
For anyone following along, to update these results what you do is run cargo insta review
@blaginin maybe we can write up / make a screen cast showing the update process?
FYI @jayzhan211 @berkaysynnada @Omega359 . What do you think?
actual.trim().to_string() | ||
} | ||
|
||
pub fn batches_to_sort_string(batches: &[RecordBatch]) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using strings for comparison works well - let's try this
@@ -429,16 +434,16 @@ async fn drop_with_quotes() -> Result<()> { | |||
|
|||
let df_results = df.collect().await?; | |||
|
|||
assert_batches_sorted_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something @crepererum did in influxdb_iox was to create a new that wraps the results so the tests look like this:
Note sure if that is 100% relevant but it is a neat pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests seem much better now, thank you @blaginin. I left 2 suggestions.
BTW, do you have any script etc. to make auto-conversions? or we have to update all others by hand?
|
||
// Get string representation of the plan | ||
async fn assert_physical_plan(df: &DataFrame, expected: Vec<&str>) { | ||
async fn physical_plan_to_string(df: &DataFrame) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name sounds like (Arc<dyn ExecutionPlan>) -> (String)
. Perhaps we can make this a method of Dataframe, and call it something like "to_physical_plan_string". WDYT?
@@ -429,16 +434,16 @@ async fn drop_with_quotes() -> Result<()> { | |||
|
|||
let df_results = df.collect().await?; | |||
|
|||
assert_batches_sorted_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we first sort the batches and then convert to strings? so the orders are preserved
Created an epic for subsequent changes, wrote my approach and automation ideas there |
I think we also need to update tests in |
Great, I saw #15178 |
I think this is a definite improvement! |
I updated this PR to say it closes this ticket. Let me know if that isn't right |
I merged this branch up from main to rerun the tests. Once CI passes I'll plan to merge this in |
🚀 |
Thanks again @blaginin Do you plan to organize some more migration as part of It will likely be a good exercise to learn how to get projects tee'd up for the community. I recommend making one or two tickets focused on a particular set of tests and see if others can follow the pattern. Once it is clear they can, then we can apply the copy/paste method to create a bunch of good first issue tickets for the other files that need porting This is going to be so good |
Thank you!! Yes, absolutely, i’m going to create a few good first issues now |
Which issue does this PR close?
insta
for tests) #15178Rationale for this change
Switch to insta snapshots so review and update will be easier
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
No