Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plots: fix plots nested configuration #10320

Merged
merged 11 commits into from
Feb 29, 2024
Merged

plots: fix plots nested configuration #10320

merged 11 commits into from
Feb 29, 2024

Conversation

AlexandreKempf
Copy link
Contributor

When I was working on system metrics on DVClive I did some tests:

I had some .tsv files organized this way:

plots/
    metrics/
        system/
            cpu.tsv
            ram.tsv
        epoch.tsv
        loss.tsv

I also wanted the plots for epochs.tsv and loss.tsv to have step number as the x-axis and cpu.tsv and ram.tsv to have timestamp as the x-axis.
So I did this in the dvc.yaml:

plots:
- dvclive/plots/metrics/system:
    x: timestamp
- dvclive/plots/metrics:
    x: step

It felt like a natural extension of the documentation, but I realized this didn't work.

Two things were going on:
1 - a mutable reference to a dict was messing with the configuration of the plot (fix in this PR)
2 - we collect plots in the order mentioned in dvc.yaml. In that case, the x-axis for the folder dvclive/plots/metrics overrides the values set by a subdirectory. To avoid that, I now sort the paths from the shortest to the longest.

It works wonderfully on my demo project, but I need to add some unit tests.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Copy link

codecov bot commented Feb 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.59%. Comparing base (5be23cf) to head (dc88972).

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10320      +/-   ##
==========================================
+ Coverage   90.58%   90.59%   +0.01%     
==========================================
  Files         499      500       +1     
  Lines       38547    38589      +42     
  Branches     5567     5572       +5     
==========================================
+ Hits        34916    34958      +42     
  Misses       2986     2986              
  Partials      645      645              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@mattseddon mattseddon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test

from dvc.repo.plots import Plots


def test_plots_definitions_works_with_nested_plots(tmp_dir, scm, dvc):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good.

[C] A few things from this test for me:

  1. breadcrumbs would be helpful, i.e. naming the variables to reflect what we are trying to test here. My understanding is that we are verifying that a plot definition from a sub-directory is not being overridden by its parent.
  2. This would be a good test to have in tests/integration/plots/test_plots.py and ensure that nothing untoward happens to the data later in the system. For example, does step get correctly assigned to x when no x is present?
  3. If I remember correctly the plot_id can be a string like "acc vs loss". I don't think that information breaks the solution but I would like to see that tested as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Does this work for y as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review Matt :)
I tried to take your comments into account, let me know if we understood each others, or if I get you wrong.

Yes it works for y as well, I added a test condition to show you that ;)

# sort by key length to process the most general paths first
# then override them with more specific ones.
for plot_id, plot_props in sorted(
definitions.items(), key=lambda item: len(item[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[C] Whilst this is an elegant solution the chances of it coming back to bite us is (IMO) high because we are mutating data inside this dense loop. In my experience explicit is better than implicit and explicitly removing child paths from parent definitions would be a better way to go. I will leave this up to you.

# sort by key length to process the most general paths first
# then override them with more specific ones.

I would change this to sort by key length to process the parent paths first and override definitions with more specific (child) definitions later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let you compare the two implementations. While I agree in general explicit >> implicit, in that specific case I needed to introduce a nasty little logic. Let me know what is the more clear between the two options (implicit + comment vs explicit) ;)

Copy link
Member

@mattseddon mattseddon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things to consider but overall very good. Thanks.

@@ -519,7 +524,7 @@ def unpack_if_dir(fs, path, props: dict[str, str], onerror: Optional[Callable] =

if "data" in unpacked:
for subpath in unpacked["data"]:
result["data"].update({subpath: props})
result["data"].update({subpath: {} | props}) # avoid shared reference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use .copy() to make your intentions clearer for doing a shallow copy. No need of a comment.

Suggested change
result["data"].update({subpath: {} | props}) # avoid shared reference
result["data"].update({subpath: props.copy()})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is shallow copy needed at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion :)

Yeah we need to copy this props reference. Otherwise I ended up with cases like this:

architecture:

metrics/
    subdir/
        wonderful_plot.tsv
    super_plot.tsv

dvc.yaml

plots:
    metrics:
        x: step
    metrics/subdir:
        x: timestamp

The first call to this function set all the plot configurations for the metrics folder (wonderful and super plots). Which is what we want. The second call, you expect that it only changes the configs for the content of metrics/subdir (in that case super_plot.tsv. It does so 🎉 , but as he mutate the same dictionary reference (props), the wonderful_plot configuration ends up changes to x: timestamp too. 😿

@AlexandreKempf AlexandreKempf marked this pull request as ready for review February 28, 2024 18:17
@dberenbaum dberenbaum merged commit 5f7acd4 into main Feb 29, 2024
21 checks passed
@dberenbaum dberenbaum deleted the fix-plots-nested-config branch February 29, 2024 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants