Skip to content

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Mar 1, 2025

Which issue does this PR close?

Initial PR:

Rationale for this change

  1. Revert the reverted PR Revert Datafusion-cli: Redesign the datafusion-cli execution and print, make it totally streaming printing without memory overhead #14948
  2. Fix multi-lines printing issue for datafusion-cli

What changes are included in this PR?

  1. Revert the reverted PR
  2. Fix multi-lines printing issue for datafusion-cli

Are these changes tested?

Yes

Are there any user-facing changes?

Fix the explain format

@zhuqi-lucas zhuqi-lucas marked this pull request as draft March 1, 2025 05:31
@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 1, 2025

Sorry for the inconvenience!

We can review and merge this PR after the release: cc @alamb @xudong963

  1. Revert the reverted PR
  2. Fix multi-lines printing issue for datafusion-cli

Meanwhile, i think more forks can also test it, i am not sure if we will have more bugs come since it's a totally new way for datafusion-cli, and i have added more testing cases, thanks!

@zhuqi-lucas
Copy link
Contributor Author

Updated testing result:

DataFusion CLI v45.0.0
> create table foo(x int, y int) as values (1,2), (3,4);
0 row(s) fetched.
Elapsed 0.026 seconds.

> explain select * from foo where x = 4;
+---------------+-------------------------------------------------------+
| plan_type     | plan                                                  |
+---------------+-------------------------------------------------------+
| logical_plan  | Filter: foo.x = Int32(4)                              |
|               |   TableScan: foo projection=[x, y]                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192           |
|               |   FilterExec: x@0 = 4                                 |
|               |     DataSourceExec: partitions=1, partition_sizes=[1] |
+---------------+-------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.022 seconds.

> explain verbose select * from foo where x = 4;
+------------------------------------------------------------+-------------------------------------------+
| plan_type                                                  | plan                                      |
+------------------------------------------------------------+-------------------------------------------+
| initial_logical_plan                                       | Projection: *                             |
|                                                            |   Filter: foo.x = Int64(4)                |
|                                                            |     TableScan: foo                        |
| logical_plan after inline_table_scan                       | SAME TEXT AS ABOVE                        |
| logical_plan after expand_wildcard_rule                    | Projection: foo.x, foo.y                  |
|                                                            |   Filter: foo.x = Int64(4)                |
|                                                            |     TableScan: foo                        |
| logical_plan after resolve_grouping_function               | SAME TEXT AS ABOVE                        |
| logical_plan after type_coercion                           | Projection: foo.x, foo.y                  |
|                                                            |   Filter: CAST(foo.x AS Int64) = Int64(4) |
|                                                            |     TableScan: foo                        |
| analyzed_logical_plan                                      | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_nested_union                  | SAME TEXT AS ABOVE                        |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                        |
| logical_plan after unwrap_cast_in_comparison               | Projection: foo.x, foo.y                  |
|                                                            |   Filter: foo.x = Int32(4)                |
|                                                            |     TableScan: foo                        |
| logical_plan after replace_distinct_aggregate              | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_join                          | SAME TEXT AS ABOVE                        |
| logical_plan after decorrelate_predicate_subquery          | SAME TEXT AS ABOVE                        |
| logical_plan after scalar_subquery_to_join                 | SAME TEXT AS ABOVE                        |
| logical_plan after extract_equijoin_predicate              | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_duplicated_expr               | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_filter                        | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_cross_join                    | SAME TEXT AS ABOVE                        |
| logical_plan after common_sub_expression_eliminate         | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_limit                         | SAME TEXT AS ABOVE                        |
| logical_plan after propagate_empty_relation                | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_one_union                     | SAME TEXT AS ABOVE                        |
| logical_plan after filter_null_join_keys                   | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_outer_join                    | SAME TEXT AS ABOVE                        |
| logical_plan after push_down_limit                         | SAME TEXT AS ABOVE                        |
| logical_plan after push_down_filter                        | SAME TEXT AS ABOVE                        |
| logical_plan after single_distinct_aggregation_to_group_by | SAME TEXT AS ABOVE                        |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                        |
| logical_plan after unwrap_cast_in_comparison               | SAME TEXT AS ABOVE                        |
| logical_plan after common_sub_expression_eliminate         | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_group_by_constant             | SAME TEXT AS ABOVE                        |
| logical_plan after optimize_projections                    | Filter: foo.x = Int32(4)                  |
|                                                            |   TableScan: foo projection=[x, y]        |
| logical_plan after eliminate_nested_union                  | SAME TEXT AS ABOVE                        |
| logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                        |
| logical_plan after unwrap_cast_in_comparison               | SAME TEXT AS ABOVE                        |
| logical_plan after replace_distinct_aggregate              | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_join                          | SAME TEXT AS ABOVE                        |
| logical_plan after decorrelate_predicate_subquery          | SAME TEXT AS ABOVE                        |
| logical_plan after scalar_subquery_to_join                 | SAME TEXT AS ABOVE                        |
| logical_plan after extract_equijoin_predicate              | SAME TEXT AS ABOVE                        |
| logical_plan after eliminate_duplicated_expr               | SAME TEXT AS ABOVE                        |
| .                                                          | .                                         |
| .                                                          | .                                         |
| .                                                          | .                                         |
+------------------------------------------------------------+-------------------------------------------+
78 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.005 seconds.

@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review March 1, 2025 11:52
@alamb
Copy link
Contributor

alamb commented Mar 1, 2025

Thanks @zhuqi-lucas -- I'll try and check this out soon (likely tomorrow)

}

#[allow(clippy::too_many_arguments)]
pub fn process_batch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we now have some more time, I wonder if you could potentially consider refactoring this code now.

I found it somewhat hard to understand when I was trying to debug why the explain plans were not showing up reasonably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for this suggestion, may after this PR finalizing. We can open the #14961 to my branch? Is it a common way for us to collaborate?

Ok(())
}

pub async fn print_table_batch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments here, as well as how it is different than print_batches

let stdout = std::io::stdout();
let mut writer = stdout.lock();

self.format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I noticed when reviewing this code is that there are different passes for print_batches and print_table_batch

I wonder if there is some way to combine them (so we always use the same printing logic)

@zhuqi-lucas
Copy link
Contributor Author

Thanks @zhuqi-lucas -- I'll try and check this out soon (likely tomorrow)

Thank you @alamb for review and good suggestion! I will try to address it today.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 4, 2025

I am continuing polish the code besides the #14886 which will add the streaming state struct. I believe the code is more readable now.

///
/// The query_start_time is used to calculate the elapsed time for the query.
/// The schema is used to print the header.
pub async fn print_stream(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unified to one API for all cases.

["--command", "show datafusion.execution.batch_size", "--format", "json", "-q", "-b", "1"],
"[{\"name\":\"datafusion.execution.batch_size\",\"value\":\"1\"}]\n"
)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add rich testing cases in cli_integration, it can avoid regression for datafusion-cli changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to use snapshots now (merge latest main to see the changes in that file). lmk if you need help

@zhuqi-lucas zhuqi-lucas changed the title Bug: Fix multi-lines printing issue for datafusion-cli feat: Fix multi-lines printing issue for datafusion-cli and add the streaming printing feature back Mar 14, 2025
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label May 18, 2025
@github-actions github-actions bot closed this Jun 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Stale PR has not had any activity for some time

Projects

None yet

3 participants