Skip to content

gh-110019: Refactor summarize_stats #110398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 24, 2023

Conversation

mdboom
Copy link
Contributor

@mdboom mdboom commented Oct 5, 2023

This refactors summarize_stats so that the comparative tables are easier to make and use more common code.

Reviewing this as a diff may be rather difficult -- instead maybe just look at the file verbatim.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. This file definitely needed refactoring.

I have few comments. I think a more functional (as opposed to OO) style would help clarity, but the general design seems sound.

a_ncols = list(set(len(x) for x in a_rows))
if len(a_ncols) != 1:
raise ValueError("Table a is ragged")
class Stats(dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inheriting from builtin collections can be awkward.
Could you wrap the dict?

else:
ncols = b_ncols[0]
elif input.is_dir():
stats: collections.Counter = collections.Counter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type annotation seems redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but mypy requires it :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use mypy then?

Is there a Mypy issue for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should make mypy happy and Mark less unhappy:

Suggested change
stats: collections.Counter = collections.Counter()
stats = collections.Counter[str]()

Copy link
Member

@AlexWaygood AlexWaygood Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use mypy then?

Is there a Mypy issue for this?

It's not a mypy bug. Pyright will complain at you in exactly the same way about this. Mypy can't tell what kind of items are going to be stored as keys for the Counter, so it demands an explicit annotation. @mdboom's annotation shuts mypy up, because collections.Counter as a type annotation is equivalent to collections.Counter[Any]. But the better solution is to use collections.Counter[str], because they keys of stats are all strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a mypy bug.
Omitting the annotation and providing collections.Counter as an annotation has exactly the same information content. So complaining about one and not the other is erroneous.
I assume mypy will not complain about stats = collections.Counter[str]() then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there's a usability bug here; mypy isn't communicating what it wants from you very clearly at all.

I assume mypy will not complain about stats = collections.Counter[str]() then.

Correct, that will make mypy happy (#110398 (comment))


@property
def defines(self) -> Defines:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wrap the dict, then this can be a normal attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but since this value should be saved, it's convenient for it to be on the dictionary which is dumped to JSON.

]
stats["_stats_defines"] = get_stats_defines()
stats["_defines"] = get_defines()
class CountPer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a Ratio, or just a float?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because my secret plan is to also introduce CSV output, where we would want to format this differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also a Ratio is rendered as a percentage, a CountPer is rendered as an integer. "Uops run per trace" is much better represented as an integer rather than a percentage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think uops per trace is more naturally a float than an int. I agree that 30[.0] is better than 3000% though.
Maybe add a percent: bool=True argument to Ratio's __init__?

comparative: bool = True,
):
self.title = title
if not summary:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the summary is explicitly "", it is ignored. Keeping the default as None seems better.

print(file=out)


class FixedTable(Table):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like all these subclasses of Table, it mixes up the responsibilities of creating the contents and the formatting.
Having a single Table class which does the formatting and takes the contents as a parameter to its __init__ would separate the responsibilities better.
So instead of ExecutionCountTable("uops"), it would be Table("uops", get_uop_counts())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more complicated than that. Every table knows how to generate a single set of results, and then also supports combining two tables for comparative results. This usually is straightforward, but for some tables (e.g. execution count table) that behavior needs to be overridden. But I'll take a look at doing all of this in a more functional way -- I'm a little worried we'll end up back where we started, though.

Copy link
Member

@markshannon markshannon Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data can support merging, etc. get_uop_counts() can return an object that supports that functionality.
I just think that separating formating from data manipulation will make things more maintainable.

mdboom added 3 commits October 5, 2023 11:21
This refactors summarize_stats so that the comparative tables are easier
to make and use more common code.
@mdboom mdboom force-pushed the summarize_stats-refactor branch from 4fde5b4 to 901b952 Compare October 5, 2023 15:22
@mdboom mdboom requested a review from markshannon October 5, 2023 20:28
@mdboom
Copy link
Contributor Author

mdboom commented Oct 5, 2023

I have modified this to:

  1. Take a more functional approach without inheritance
  2. Separate out markdown output from data handling
  3. Not have Stats inherit from dict

@markshannon
Copy link
Member

I think what bothers me with this PR is that the data processing is mixed in with the data storage. This can be a problem with OOP.

There is nothing wrong with objects that enhance and structure the data, but data processing should be separated, otherwise the code can be hard to follow.

I think we need the following:

  1. To read the data off disk and into one big poorly structured object ("blob") (we already have this https://github.com/python/cpython/blob/main/Tools/scripts/summarize_stats.py#L229-L244)
  2. A structured data type, Stats.
  3. A function to convert the "blob" to Stats
  4. Functions/methods to save and load the Stats to a file (currently json)
  5. A function/method to make a single Stats from the diff of two Stats
  6. A function to output Stats to a human readable format (currently markdown)

Each "function" above doesn't have to one function, but should be independent of the others.

By all means use objects (this is Python, not Haskell), but I'd recommend trying to stick to functional(ish) principles:

  • All classes should have a simple __init__ (as would be generated for a dataclass)
  • No method should mutate the object
  • No special methods
  • Methods should stick to the domain of the object. So for a Section object, to_rows()->list[tuple[str]] is fine, but write_markdown() maybe not so much.

With that, the base pipeline (raw stats files to markdown) would look something like:

def main():
    # Process args and find folders, etc.
    raw_stats = gather_stats(stats_dir)
    stats = structure_stats(raw_stats)
    output_stats(stats, outfile)

The Stats class should be a structured data type, with the top level containing attributes for each of the top level categories in the data, execution_counts, pair_counts, predecessor_pairs, etc.
It could simply be a list of Sections, where the Sections describe the data, if that makes more sense. The markdown file is structured as a list of sections.

A Section will need to describe how to present the data as well as contain it.
For example, execution counts, is a list of 3-tuples, name, count, miss. But it also needs data on how to present it:
It should be sorted by count, have a ratio and cumulative ratio column, and the miss should be presented as a percentage.
That could be a method, which converts the data to a table (where table is list[tuple[str]])

This PR contains the following comment:

A Table defines how to convert a set of Stats into a specific set of rows displaying some aspect of the data.

That should be a method on the Stats (or Section):

Rows: TypeAlias = list[tuple[str]]
def to_rows(self) -> Rows:
     ...

@mdboom
Copy link
Contributor Author

mdboom commented Oct 6, 2023

I think what bothers me with this PR is that the data processing is mixed in with the data storage.

Indeed, it isn't. There are 4 separate layers:

  • raw data (Stats)
  • abstract views of that data (Table)
  • organization of that data (Section)
  • output (output_markdown)

There is nothing wrong with objects that enhance and structure the data, but data processing should be separated, otherwise the code can be hard to follow.

I agree 100%, but I think this refactor does that.

The Stats class should be a structured data type, with the top level containing attributes for each of the top level categories in the data, execution_counts, pair_counts, predecessor_pairs, etc.

It could simply be a list of Sections, where the Sections describe the data, if that makes more sense. The markdown file is structured as a list of sections.

Wouldn't that be more of a combining of processing and presentation?

I think what would address your concerns is:

  • The calc_*_ functions become methods on the Stats class.
  • The Table class would go away.
  • The Section class would describe the organization of the file and how tables need to be combined.

@mdboom
Copy link
Contributor Author

mdboom commented Oct 6, 2023

@markshannon: I've largely moved the data processing inside of the Stats class and the new OpcodeStats class. The Table/Section distinction is still required, since a Section may have multiple tables etc. But I hope this is closer to what you had in mind in terms of separation of concerns.

@markshannon
Copy link
Member

Have you checked that the latest commit produces the same output as main?

@mdboom
Copy link
Contributor Author

mdboom commented Oct 23, 2023

Have you checked that the latest commit produces the same output as main?

Yes, for single datasets. For comparative the results are different, but due to bugfixes.

@markshannon markshannon merged commit 81eba76 into python:main Oct 24, 2023
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants