-
-
Notifications
You must be signed in to change notification settings - Fork 31.6k
gh-110019: Refactor summarize_stats #110398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. This file definitely needed refactoring.
I have few comments. I think a more functional (as opposed to OO) style would help clarity, but the general design seems sound.
Tools/scripts/summarize_stats.py
Outdated
a_ncols = list(set(len(x) for x in a_rows)) | ||
if len(a_ncols) != 1: | ||
raise ValueError("Table a is ragged") | ||
class Stats(dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inheriting from builtin collections can be awkward.
Could you wrap the dict?
Tools/scripts/summarize_stats.py
Outdated
else: | ||
ncols = b_ncols[0] | ||
elif input.is_dir(): | ||
stats: collections.Counter = collections.Counter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type annotation seems redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but mypy requires it :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use mypy then?
Is there a Mypy issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should make mypy happy and Mark less unhappy:
stats: collections.Counter = collections.Counter() | |
stats = collections.Counter[str]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use mypy then?
Is there a Mypy issue for this?
It's not a mypy bug. Pyright will complain at you in exactly the same way about this. Mypy can't tell what kind of items are going to be stored as keys for the Counter
, so it demands an explicit annotation. @mdboom's annotation shuts mypy up, because collections.Counter
as a type annotation is equivalent to collections.Counter[Any]
. But the better solution is to use collections.Counter[str]
, because they keys of stats
are all strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a mypy bug.
Omitting the annotation and providing collections.Counter
as an annotation has exactly the same information content. So complaining about one and not the other is erroneous.
I assume mypy will not complain about stats = collections.Counter[str]()
then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there's a usability bug here; mypy isn't communicating what it wants from you very clearly at all.
I assume mypy will not complain about
stats = collections.Counter[str]()
then.
Correct, that will make mypy happy (#110398 (comment))
Tools/scripts/summarize_stats.py
Outdated
|
||
@property | ||
def defines(self) -> Defines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wrap the dict, then this can be a normal attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but since this value should be saved, it's convenient for it to be on the dictionary which is dumped to JSON.
Tools/scripts/summarize_stats.py
Outdated
] | ||
stats["_stats_defines"] = get_stats_defines() | ||
stats["_defines"] = get_defines() | ||
class CountPer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a Ratio
, or just a float
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because my secret plan is to also introduce CSV output, where we would want to format this differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also a Ratio is rendered as a percentage, a CountPer is rendered as an integer. "Uops run per trace" is much better represented as an integer rather than a percentage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think uops per trace is more naturally a float than an int. I agree that 30[.0] is better than 3000% though.
Maybe add a percent: bool=True
argument to Ratio
's __init__
?
Tools/scripts/summarize_stats.py
Outdated
comparative: bool = True, | ||
): | ||
self.title = title | ||
if not summary: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the summary is explicitly "", it is ignored. Keeping the default as None
seems better.
Tools/scripts/summarize_stats.py
Outdated
print(file=out) | ||
|
||
|
||
class FixedTable(Table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like all these subclasses of Table
, it mixes up the responsibilities of creating the contents and the formatting.
Having a single Table
class which does the formatting and takes the contents as a parameter to its __init__
would separate the responsibilities better.
So instead of ExecutionCountTable("uops")
, it would be Table("uops", get_uop_counts())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more complicated than that. Every table knows how to generate a single set of results, and then also supports combining two tables for comparative results. This usually is straightforward, but for some tables (e.g. execution count table) that behavior needs to be overridden. But I'll take a look at doing all of this in a more functional way -- I'm a little worried we'll end up back where we started, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data can support merging, etc. get_uop_counts()
can return an object that supports that functionality.
I just think that separating formating from data manipulation will make things more maintainable.
This refactors summarize_stats so that the comparative tables are easier to make and use more common code.
4fde5b4
to
901b952
Compare
I have modified this to:
|
I think what bothers me with this PR is that the data processing is mixed in with the data storage. This can be a problem with OOP. There is nothing wrong with objects that enhance and structure the data, but data processing should be separated, otherwise the code can be hard to follow. I think we need the following:
Each "function" above doesn't have to one function, but should be independent of the others. By all means use objects (this is Python, not Haskell), but I'd recommend trying to stick to functional(ish) principles:
With that, the base pipeline (raw stats files to markdown) would look something like: def main():
# Process args and find folders, etc.
raw_stats = gather_stats(stats_dir)
stats = structure_stats(raw_stats)
output_stats(stats, outfile) The A This PR contains the following comment:
That should be a method on the Stats (or Section): Rows: TypeAlias = list[tuple[str]]
def to_rows(self) -> Rows:
... |
Indeed, it isn't. There are 4 separate layers:
I agree 100%, but I think this refactor does that.
Wouldn't that be more of a combining of processing and presentation? I think what would address your concerns is:
|
@markshannon: I've largely moved the data processing inside of the |
Have you checked that the latest commit produces the same output as main? |
Yes, for single datasets. For comparative the results are different, but due to bugfixes. |
This refactors summarize_stats so that the comparative tables are easier to make and use more common code.
Reviewing this as a diff may be rather difficult -- instead maybe just look at the file verbatim.
Tools/scripts/summarize_stats.py
#110019