-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance tests: configure as a production environment #52016
Performance tests: configure as a production environment #52016
Conversation
Related: I discovered this by running into this issue in core at WordPress/wordpress-develop#4565 (comment) |
Size Change: 0 B Total Size: 1.44 MB ℹ️ View Unchanged
|
This is the difference I see locally when I use enable/disable This is the difference in the performance CI job in this branch vs one of the last to succeed in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I'll make sure to include that before running metrics on the migrated tests 🙌
…#52229) * Patterns: Fix setting of sync status for fully synced patterns (#51952) * Add check for sync status of fully to account for bug in 16.1 release * Fix phpunit failures (#51950) * Fix phpunit failures * Performance tests: configure as a production environment (#52016) --------- Co-authored-by: Aaron Robertshaw <60436221+aaronrobertshaw@users.noreply.github.com> Co-authored-by: Robert Anderson <robert@noisysocks.com> Co-authored-by: André <583546+oandregal@users.noreply.github.com>
I think it's expected because we normalize results in codevitals. Since the change was made to the test runner script (performance.js), both the base branch and subject branch's results were affected by the change from this PR, which led to the normalization of the stored/displayed value. For example: Current TTFB in codevitals: 50 This checks out with the results from the merge commit perf comparison job: Does that make sense? cc @youknowriad |
yes, unless two versions of WP have different performance results in production and similar in development, this PR shouldn't have any impact on the graphs. |
@WunderBart @youknowriad So, a side effect of the normalization done by codevitals is that we are cancelling any change in the WordPress runtime used as base, aren't we? For example, any improvement in TTFB, LCP, etc. that didn't land in Gutenberg first is not reflected in the metrics: because they affect both the latest commit (metrictValue) and the base reference (baseMetrics), the stored values would be one close to the old reference. Even the impact of the ones that landed in Gutenberg first is mitigated, because once they come as part of the WordPress runtime they go through this process as well. This is problematic for front-end metrics: it builds "inertia" in their values and deviates from reality too much. It makes it impossible to compare how the same metric fares for block & classic themes, for example. How can we fix it? By manually updating the values stored in the codevitals database for the base reference 34af582? |
@oandregal I don't think we should fix that at all, codevitals for "Gutenberg" is about tracking Gutenberg impact and the WordPress project is about tracking WordPress impact. In other words, we should track these improvements in WordPress instead since the code change happened in WordPress. |
@youknowriad I see what you mean. My concern is that, in addition to being unable to compare block vs classic themes, there's also a perception issue. Block & classic themes have about the same TTFB (according to the WordPress dashboard and other tests) but the gutenberg dashboard reports block themes being 2,5x slower than classic ones. This may confuse people or lead them to the wrong conclusions. If not fixing the issue, what else can we do to clarify this? |
Yeah, good point. Codevitals is currently optimized to check for commit impact on metrics not for cross metric comparison, I'm not sure how to fix that and whether we should do it. Maybe we can start storing both normalized and actual values and have a page/UI where you can "focus" on a single commit and compare cross metrics (showing the real values instead of the normalized ones), the history graphs should always show the normalized values though. |
A potential solution could be a relative chart with a 0% base. Since the actual milliseconds are not our primary concern, we could opt not to display them. Instead, we can have the chart start at a 0% value (base), and then fluctuate as new data (commits) come in. For example: If we need to display the actual (not normalized) milliseconds, we could do so in a separate section or only for the most recent record tick. This is because the last measurement is the only one of significant value. Does this approach make sense? |
I'm not sure if I follow: what would be the zero, and what would be the actual values of that chart ("normalized" or the actual values)? |
The data represented in this chart essentially reflects relative percentages, which indicate the degree of deviation from a set baseline of 0%. Both normalized and "actual" values are meaningless and we don't really care about any of them - we only care about the consistency over time so that we're able to detect whether the performance is getting worse or better. I think that dropping the milliseconds would actually be liberating in this context. They only confuse, and there's no way to achieve numbers that would mean anything unless we actually start RUMming the metrics. Even then, though, I don't think they would be a good source of what we aim to achieve here, which is to track performance changes over time. |
Thanks, I see now. I don't think percentages or ms values make any difference: the gist of my concern is that normalizing the numbers we receive builds inertia, creates deviation from reality, and don't allow us to compare metrics (block vs classic). |
I believe the relative percentage would eliminate the confusion of, for example, the block theme being displayed as 2.5x slower than the classic one which you mentioned above. As for the data normalization, to my knowledge, we have to do it to (at least somewhat) accurately track the performance changes as the testing environment is inherently unstable and would cause significant value drift if we haven't. Regarding the cross-metric comparison, we do have the data for that (e.g. classic vs. block TTFB) - we're just not utilizing it that way. I imagine, though, that it would be fairly straightforward to add that metric to the codevitals dashboard. Do you think that would suffice as a solution here? |
Unfortunately, it wouldn't: you'd still see a reference (the 0% base) and inspecting both graphs you'd still see the percentage for block themes is a lot higher than for classic.
I don't mean to remove it, it's useful. Though, it comes with side effects we need to manage. To mitigate them, I don't see any option other than resetting the values regularly: perhaps every time we bump the WordPress base, once a year, etc.
I don't mean to have a new metric/card/etc. My point is that the numbers we already display, we cannot compare. Though anyone who visits codevitals does, and that's an issue because they go away with the wrong impression (2.5x difference between block and classic themes). That's my concern. |
We can check that! 😄 I downloaded the TTFB data for both classic and block themes and transformed the values to relative percentages, setting the base value as 0%. Below are the visuals. The snap on the left is from codevitals, and on the right is from my sheet. TTFB, Classic Theme
TTFB, Block Theme
As you can see, the relative value is actually showing the same increase (~40%) since we started measuring, which is a correct indication that also doesn't suggest the 2.5x difference. Interestingly, the first recorded sample is from a commit from February, which means that, roughly, between the v15 and v16 (major) releases the TTFB for both themes went up 40%. I think it's a good indication that we should be resetting the base at least every major GB release (every ~4-6 months). IMO it would provide a clear 'snapshot' view of the performance improvements or regressions with each new release, which can help understand the changes in a concise and relevant way. It would also ensure that the relative measurements are not excessively influenced by old data or anomalies from past versions. Does that make sense? /cc @dmsnell |
What?
Sets up the
WP_DEBUG
andSCRIPTS_DEBUG
tofalse
, so the WordPress instance resembles a production environment.Why?
Certain aspects of performance are different when running when debug enabled: some caches may be disabled, assets are minified, etc. We should aim to measure performance as close as possible to a production environment.
This is the difference in the performance CI job in this branch vs one of the last to succeed in
trunk
(can also be compared with current codevitals metrics):How?
Sets the
WP_DEBUG
andSCRIPTS_DEBUG
global variables tofalse
.Testing Instructions
See the performance results.