Skip to content

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Oct 5, 2025

This PR adds a number of improvements and revisions to the Xarray's HTML reprs, especially for DataTree:

  1. No line breaks in long headers like "Data variables" and "Inherited Coordinates"
  2. Add ~4px of extra padding at the end of HTML reprs, to make pages like Xarray's docs look a little better
  3. Remove 2px shift on headers when actively clicked on. (I think this was intentional, but it seems to result in weird layout glitches because the :active selector doesn't always go away when focus is moved elsewhere)
  4. Remove the collapsable "Groups" header from DataTree. Instead, each group is separately collapsable, and shows the total number of contained elements.
  5. Truncation for too HTML elements is revised. I've added the options display_max_items and display_max_html_elements for controlling at what point the DataTree HTML repr collapses and truncates nodes, instead of doing this all based on display_max_children.

This needs a few more tests and release notes, but is ready for feedback! @jsignell @TomNicholas @benbovy

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

Code to generate HTML previews:

import xarray as xr
import numpy as np

# Set up coordinates
time = xr.DataArray(data=["2022-01", "2023-01"], dims="time")
stations = xr.DataArray(data=list("abcdef"), dims="station")
lon = [-100, -80, -60]
lat = [10, 20, 30]

# Set up fake data
wind_speed = xr.DataArray(np.ones((2, 6)) * 2, dims=("time", "station"))
pressure = xr.DataArray(np.ones((2, 6)) * 3, dims=("time", "station"))
air_temperature = xr.DataArray(np.ones((2, 6)) * 4, dims=("time", "station"))
dewpoint = xr.DataArray(np.ones((2, 6)) * 5, dims=("time", "station"))
infrared = xr.DataArray(np.ones((2, 3, 3)) * 6, dims=("time", "lon", "lat"))
true_color = xr.DataArray(np.ones((2, 3, 3)) * 7, dims=("time", "lon", "lat"))

dt2 = xr.DataTree.from_dict(
    {
        "/": xr.Dataset(
            coords={"time": time},
        ),
        "/weather": xr.Dataset(
            coords={"station": stations},
            data_vars={
                "wind_speed": wind_speed,
                "pressure": pressure,
            },
        ),
        "/weather/temperature": xr.Dataset(
            data_vars={
                "air_temperature": air_temperature,
                "dewpoint": dewpoint,
            },
        ),
        "/satellite": xr.Dataset(
            coords={"lat": lat, "lon": lon},
            data_vars={
                "infrared": infrared,
                "true_color": true_color,
            },
        ),
    },
)
dt2['/other'] = xr.Dataset({f'x{i}': 0 for i in range(500)})

number_of_files = 20
number_of_groups = 50
tree_dict = {}
for f in range(number_of_files):
    for g in range(number_of_groups):
        tree_dict[f"file_{f}/group_{g}"] = xr.Dataset({"g": f * g})
tree_too_many = xr.DataTree.from_dict(tree_dict)


print("<h1>DataTree root</h1>")
print(dt2._repr_html_())

print("<hr />")
print("<h1>Dataset</h1>")

print(dt2.weather.to_dataset()._repr_html_())

print("<hr />")

print("<h1>DataTree inherited</h1>")
print(dt2.weather._repr_html_())

print("<hr />")
print("<h1>DataTree too many nodes</h1>")
print(tree_too_many._repr_html_())

Revised (this PR)

Interactive preview

image

Baseline

Interactive preview

image

@jsignell
Copy link
Contributor

jsignell commented Oct 9, 2025

Ok I took a look at this with this kind of evil DataTree from the truncation work:

import numpy as np
import xarray as xr

number_of_files = 700
number_of_groups = 5
number_of_variables= 10

datasets = {}
for f in range(number_of_files):
    for g in range(number_of_groups):
        # Create random data
        time = np.linspace(0, 50 + f, 1 + 1000 * g)
        y = f * time + g

        # Create dataset:
        ds = xr.Dataset(
            data_vars={
                f"temperature_{g}{i}": ("time", y)
                for i in range(number_of_variables // number_of_groups)
            },
            coords={"time": ("time", time)},
        ).chunk()

        # Prepare for xr.DataTree:
        name = f"file_{f}/group_{g}"
        datasets[name] = ds

dt = xr.DataTree.from_dict(datasets)

I really like the space changes and removing the collapsible "Groups" header and having each group be collapsible on its own.

I wasn't quite sure how to interpret the collapsed count for a group that just has one dataset in it. It seems like it is the n coords + n data_vars. Which seems odd. I think there shouldn't be a count on a group that just contains a single dataset.

The group level count when there are child groups should just be the number of groups.

image

I like the idea of having a display_max_html_elements and would be happy for it to be a lot lower than 300 by default, but truncation is still necessary for the case where there just are more than display_max_html_elements at the top level.

For instance you still get 700 top-level nodes in the repr when you do:

with xr.set_options(display_max_html_elements=5):
    display(dt)

I think in general it would be nice to be able to drill down into a particular node within the repr even if there are a bunch of items at a particular level.

@shoyer
Copy link
Member Author

shoyer commented Oct 9, 2025

I wasn't quite sure how to interpret the collapsed count for a group that just has one dataset in it. It seems like it is the n coords + n data_vars. Which seems odd. I think there shouldn't be a count on a group that just contains a single dataset.

The group level count when there are child groups should just be the number of groups.

The strategy I was using is counting the number of hidden items (at any level), with the idea being that it should be obvious if a large amount of data is hidden. Otherwise you could have a collapsed group marked as "(1)" that hides hundreds of data variables, which felt wrong to me.

I like the idea of having a display_max_html_elements and would be happy for it to be a lot lower than 300 by default, but truncation is still necessary for the case where there just are more than display_max_html_elements at the top level.

Do you think this is common? I don't think we do this for the other Xarray HTML reprs. They get collapsed but nodes are not truncated at the top level.

I think in general it would be nice to be able to drill down into a particular node within the repr even if there are a bunch of items at a particular level.

I am currently displaying DataTree elements in priority order, based on showing the top-most levels as completely as possible (breadth-first). We could start by going deep (depth-first), but this would mean that some high-level nodes could be truncated.

Maybe there's some compromise algorithm that could work better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants