Skip to content

Conversation

@comphead
Copy link
Contributor

@comphead comphead commented Sep 7, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

This PR adds a new print_schema_tree() method to DFSchema that formats schema information in a tree-like structure similar to Apache Spark's schema display format, with proper nested indentation for complex data types.

Core Implementation
File: datafusion/common/src/dfschema.rs

  • Added print_schema_tree() method
    Public method that formats the entire schema in tree structure
    Handles both qualified and unqualified field names
    Returns formatted string with "root" header and proper indentation

Added format_field_with_indent() helper function

Recursive function that handles nested indentation for complex types
Supports proper tree structure with |-- and | indentation
Handles all Arrow data types

Example output for map of array

root
 |-- array_map_field: list (nullable = false)
 |    |-- item: map (nullable = false)
 |    |    |-- key: string (nullable = false)
 |    |    |-- value: string (nullable = false)

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner common Related to common crate labels Sep 7, 2025
.map(|c| self.ident_normalizer.normalize(c))
.enumerate()
.map(|(i, c)| {
let c = self.ident_normalizer.normalize(c);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unrelated change, just removing unnecessary loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me like this change just moves the call to normalize the identifier into the map function (I don't think there are any loops removed 🤔 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was 2 map before, I hope rustc is smart enough to merge them in runtime, but in this case it slighlty more readable as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change is fine, I just wanted to make sure I understood what was going on. Thank you @comphead

@comphead comphead changed the title feat: Implement DFSchema.print_schema() method feat: Implement DFSchema.print_schema_tree() method Sep 7, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @comphead -- I have some suggestions, but nothing that I think would prevent this PR from merging

BTW I was wondering if this would be a better default Display for DFSchema, but it seems like there is already a default implementation

.unwrap();

let output = schema.print_schema_tree();
let expected = "root\n |-- id: int32 (nullable = false)\n |-- name: string (nullable = true)\n |-- age: int64 (nullable = true)\n |-- active: boolean (nullable = false)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using insta might make these cases easier to update / easier to see. Something like this perhaps

insta::assert_snapshot!(batches_to_string(&actual), @r###"
+---------------------------------------+
| sum(arrow_cast(t.time,Utf8("Int64"))) |
+---------------------------------------+
| 19000 |
+---------------------------------------+
"###);

)
.unwrap();

let output = schema.print_schema_tree();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think an insta snapshot would be better here

.map(|c| self.ident_normalizer.normalize(c))
.enumerate()
.map(|(i, c)| {
let c = self.ident_normalizer.normalize(c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me like this change just moves the call to normalize the identifier into the map function (I don't think there are any loops removed 🤔 )

@comphead
Copy link
Contributor Author

comphead commented Sep 8, 2025

BTW I was wondering if this would be a better default Display for DFSchema, but it seems like there is already a default implementation

Display provides more information now, like metadata, dict_ordering. Not sure if it is a right time to replace but in future why not.

To display tree schema we need to ship 1 more DDL function to show the schema in the CLI or by calling DataFusion sql.
Alternatives what DDL can be chosen are #17466, I'm planning to make a vote on this

comphead and others added 4 commits September 8, 2025 10:38
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
.map(|c| self.ident_normalizer.normalize(c))
.enumerate()
.map(|(i, c)| {
let c = self.ident_normalizer.normalize(c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change is fine, I just wanted to make sure I understood what was going on. Thank you @comphead

@alamb alamb merged commit fcd820e into apache:main Sep 9, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create schema print out method

2 participants