Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post a summary of the flaky tests to the commit #45798

Merged
merged 3 commits into from
Dec 16, 2022
Merged

Conversation

kevin940726
Copy link
Member

@kevin940726 kevin940726 commented Nov 16, 2022

What?

This is an attempt to bring more visibility to flaky tests during reviews. It will post a comment on the commit which has flaky tests.

Why?

The original flaky tests reporter started with the idea of retrying failed tests to unblock contributors in PRs with flaky tests. However, a side-effect is that flaky tests tend to be overlooked and ignored by the author. This PR tries to fix that by posting the summary of flaky tests to the commit without blocking the PR authors.

How?

Tons of improvements are in this PR:

  • End-to-end testing workflows are merged into one. It seems like the only way to ensure only one run of the report-flaky-tests job will be run after all e2e jobs are finished.
  • We'll bail out of the report-flaky-tests job if there isn't any flaky tests report.
  • We no longer split artifacts into different names from different matrixes. It seems like GitHub allows uploading different files to the same artifact.

Testing Instructions

I'm not aware of an easy way to test this. It's unfortunately a problem with custom github actions. The best way I can think of is to fork the repo and test it there.

See below for detailed instructions on how to test it in your own fork: #45798 (comment).

Screenshots or screencast

Here's what it might look like in the commit:

image

Or you can view it here: kevin940726@b87c01c#commitcomment-90061333

@kevin940726 kevin940726 added the [Type] Automated Testing Testing infrastructure changes impacting the execution of end-to-end (E2E) and/or unit tests. label Nov 16, 2022
@codesandbox
Copy link

codesandbox bot commented Nov 16, 2022

CodeSandbox logoCodeSandbox logo  Open in CodeSandbox Web Editor | VS Code | VS Code Insiders

@github-actions
Copy link

github-actions bot commented Nov 16, 2022

Size Change: 0 B

Total Size: 1.32 MB

ℹ️ View Unchanged
Filename Size
build/a11y/index.min.js 993 B
build/annotations/index.min.js 2.78 kB
build/api-fetch/index.min.js 2.27 kB
build/autop/index.min.js 2.15 kB
build/blob/index.min.js 487 B
build/block-directory/index.min.js 7.16 kB
build/block-directory/style-rtl.css 1.04 kB
build/block-directory/style.css 1.04 kB
build/block-editor/content-rtl.css 2.71 kB
build/block-editor/content.css 2.71 kB
build/block-editor/default-editor-styles-rtl.css 403 B
build/block-editor/default-editor-styles.css 403 B
build/block-editor/index.min.js 182 kB
build/block-editor/style-rtl.css 14.7 kB
build/block-editor/style.css 14.6 kB
build/block-library/blocks/archives/editor-rtl.css 61 B
build/block-library/blocks/archives/editor.css 60 B
build/block-library/blocks/archives/style-rtl.css 90 B
build/block-library/blocks/archives/style.css 90 B
build/block-library/blocks/audio/editor-rtl.css 150 B
build/block-library/blocks/audio/editor.css 150 B
build/block-library/blocks/audio/style-rtl.css 122 B
build/block-library/blocks/audio/style.css 122 B
build/block-library/blocks/audio/theme-rtl.css 138 B
build/block-library/blocks/audio/theme.css 138 B
build/block-library/blocks/avatar/editor-rtl.css 116 B
build/block-library/blocks/avatar/editor.css 116 B
build/block-library/blocks/avatar/style-rtl.css 84 B
build/block-library/blocks/avatar/style.css 84 B
build/block-library/blocks/block/editor-rtl.css 305 B
build/block-library/blocks/block/editor.css 305 B
build/block-library/blocks/button/editor-rtl.css 485 B
build/block-library/blocks/button/editor.css 485 B
build/block-library/blocks/button/style-rtl.css 532 B
build/block-library/blocks/button/style.css 532 B
build/block-library/blocks/buttons/editor-rtl.css 337 B
build/block-library/blocks/buttons/editor.css 337 B
build/block-library/blocks/buttons/style-rtl.css 332 B
build/block-library/blocks/buttons/style.css 332 B
build/block-library/blocks/calendar/style-rtl.css 239 B
build/block-library/blocks/calendar/style.css 239 B
build/block-library/blocks/categories/editor-rtl.css 84 B
build/block-library/blocks/categories/editor.css 83 B
build/block-library/blocks/categories/style-rtl.css 100 B
build/block-library/blocks/categories/style.css 100 B
build/block-library/blocks/code/editor-rtl.css 53 B
build/block-library/blocks/code/editor.css 53 B
build/block-library/blocks/code/style-rtl.css 121 B
build/block-library/blocks/code/style.css 121 B
build/block-library/blocks/code/theme-rtl.css 124 B
build/block-library/blocks/code/theme.css 124 B
build/block-library/blocks/columns/editor-rtl.css 108 B
build/block-library/blocks/columns/editor.css 108 B
build/block-library/blocks/columns/style-rtl.css 406 B
build/block-library/blocks/columns/style.css 406 B
build/block-library/blocks/comment-author-avatar/editor-rtl.css 125 B
build/block-library/blocks/comment-author-avatar/editor.css 125 B
build/block-library/blocks/comment-content/style-rtl.css 92 B
build/block-library/blocks/comment-content/style.css 92 B
build/block-library/blocks/comment-template/style-rtl.css 199 B
build/block-library/blocks/comment-template/style.css 198 B
build/block-library/blocks/comments-pagination-numbers/editor-rtl.css 123 B
build/block-library/blocks/comments-pagination-numbers/editor.css 121 B
build/block-library/blocks/comments-pagination/editor-rtl.css 222 B
build/block-library/blocks/comments-pagination/editor.css 209 B
build/block-library/blocks/comments-pagination/style-rtl.css 235 B
build/block-library/blocks/comments-pagination/style.css 231 B
build/block-library/blocks/comments-title/editor-rtl.css 75 B
build/block-library/blocks/comments-title/editor.css 75 B
build/block-library/blocks/comments/editor-rtl.css 840 B
build/block-library/blocks/comments/editor.css 839 B
build/block-library/blocks/comments/style-rtl.css 637 B
build/block-library/blocks/comments/style.css 636 B
build/block-library/blocks/cover/editor-rtl.css 612 B
build/block-library/blocks/cover/editor.css 613 B
build/block-library/blocks/cover/style-rtl.css 1.57 kB
build/block-library/blocks/cover/style.css 1.56 kB
build/block-library/blocks/embed/editor-rtl.css 293 B
build/block-library/blocks/embed/editor.css 293 B
build/block-library/blocks/embed/style-rtl.css 410 B
build/block-library/blocks/embed/style.css 410 B
build/block-library/blocks/embed/theme-rtl.css 138 B
build/block-library/blocks/embed/theme.css 138 B
build/block-library/blocks/file/editor-rtl.css 300 B
build/block-library/blocks/file/editor.css 300 B
build/block-library/blocks/file/style-rtl.css 253 B
build/block-library/blocks/file/style.css 254 B
build/block-library/blocks/file/view.min.js 353 B
build/block-library/blocks/freeform/editor-rtl.css 2.44 kB
build/block-library/blocks/freeform/editor.css 2.44 kB
build/block-library/blocks/gallery/editor-rtl.css 984 B
build/block-library/blocks/gallery/editor.css 988 B
build/block-library/blocks/gallery/style-rtl.css 1.55 kB
build/block-library/blocks/gallery/style.css 1.55 kB
build/block-library/blocks/gallery/theme-rtl.css 122 B
build/block-library/blocks/gallery/theme.css 122 B
build/block-library/blocks/group/editor-rtl.css 654 B
build/block-library/blocks/group/editor.css 654 B
build/block-library/blocks/group/style-rtl.css 57 B
build/block-library/blocks/group/style.css 57 B
build/block-library/blocks/group/theme-rtl.css 78 B
build/block-library/blocks/group/theme.css 78 B
build/block-library/blocks/heading/style-rtl.css 76 B
build/block-library/blocks/heading/style.css 76 B
build/block-library/blocks/html/editor-rtl.css 332 B
build/block-library/blocks/html/editor.css 333 B
build/block-library/blocks/image/editor-rtl.css 829 B
build/block-library/blocks/image/editor.css 828 B
build/block-library/blocks/image/style-rtl.css 627 B
build/block-library/blocks/image/style.css 630 B
build/block-library/blocks/image/theme-rtl.css 137 B
build/block-library/blocks/image/theme.css 137 B
build/block-library/blocks/latest-comments/style-rtl.css 298 B
build/block-library/blocks/latest-comments/style.css 298 B
build/block-library/blocks/latest-posts/editor-rtl.css 213 B
build/block-library/blocks/latest-posts/editor.css 212 B
build/block-library/blocks/latest-posts/style-rtl.css 478 B
build/block-library/blocks/latest-posts/style.css 478 B
build/block-library/blocks/list/style-rtl.css 88 B
build/block-library/blocks/list/style.css 88 B
build/block-library/blocks/media-text/editor-rtl.css 266 B
build/block-library/blocks/media-text/editor.css 263 B
build/block-library/blocks/media-text/style-rtl.css 507 B
build/block-library/blocks/media-text/style.css 505 B
build/block-library/blocks/more/editor-rtl.css 431 B
build/block-library/blocks/more/editor.css 431 B
build/block-library/blocks/navigation-link/editor-rtl.css 716 B
build/block-library/blocks/navigation-link/editor.css 715 B
build/block-library/blocks/navigation-link/style-rtl.css 115 B
build/block-library/blocks/navigation-link/style.css 115 B
build/block-library/blocks/navigation-submenu/editor-rtl.css 299 B
build/block-library/blocks/navigation-submenu/editor.css 299 B
build/block-library/blocks/navigation/editor-rtl.css 2.15 kB
build/block-library/blocks/navigation/editor.css 2.16 kB
build/block-library/blocks/navigation/style-rtl.css 2.23 kB
build/block-library/blocks/navigation/style.css 2.21 kB
build/block-library/blocks/navigation/view-modal.min.js 2.81 kB
build/block-library/blocks/navigation/view.min.js 447 B
build/block-library/blocks/nextpage/editor-rtl.css 395 B
build/block-library/blocks/nextpage/editor.css 395 B
build/block-library/blocks/page-list/editor-rtl.css 363 B
build/block-library/blocks/page-list/editor.css 363 B
build/block-library/blocks/page-list/style-rtl.css 175 B
build/block-library/blocks/page-list/style.css 175 B
build/block-library/blocks/paragraph/editor-rtl.css 174 B
build/block-library/blocks/paragraph/editor.css 174 B
build/block-library/blocks/paragraph/style-rtl.css 279 B
build/block-library/blocks/paragraph/style.css 281 B
build/block-library/blocks/post-author/style-rtl.css 175 B
build/block-library/blocks/post-author/style.css 176 B
build/block-library/blocks/post-comments-form/editor-rtl.css 96 B
build/block-library/blocks/post-comments-form/editor.css 96 B
build/block-library/blocks/post-comments-form/style-rtl.css 501 B
build/block-library/blocks/post-comments-form/style.css 501 B
build/block-library/blocks/post-date/style-rtl.css 61 B
build/block-library/blocks/post-date/style.css 61 B
build/block-library/blocks/post-excerpt/editor-rtl.css 73 B
build/block-library/blocks/post-excerpt/editor.css 73 B
build/block-library/blocks/post-excerpt/style-rtl.css 69 B
build/block-library/blocks/post-excerpt/style.css 69 B
build/block-library/blocks/post-featured-image/editor-rtl.css 586 B
build/block-library/blocks/post-featured-image/editor.css 584 B
build/block-library/blocks/post-featured-image/style-rtl.css 318 B
build/block-library/blocks/post-featured-image/style.css 318 B
build/block-library/blocks/post-navigation-link/style-rtl.css 153 B
build/block-library/blocks/post-navigation-link/style.css 153 B
build/block-library/blocks/post-template/editor-rtl.css 99 B
build/block-library/blocks/post-template/editor.css 98 B
build/block-library/blocks/post-template/style-rtl.css 282 B
build/block-library/blocks/post-template/style.css 282 B
build/block-library/blocks/post-terms/style-rtl.css 96 B
build/block-library/blocks/post-terms/style.css 96 B
build/block-library/blocks/post-title/style-rtl.css 100 B
build/block-library/blocks/post-title/style.css 100 B
build/block-library/blocks/preformatted/style-rtl.css 103 B
build/block-library/blocks/preformatted/style.css 103 B
build/block-library/blocks/pullquote/editor-rtl.css 135 B
build/block-library/blocks/pullquote/editor.css 135 B
build/block-library/blocks/pullquote/style-rtl.css 326 B
build/block-library/blocks/pullquote/style.css 325 B
build/block-library/blocks/pullquote/theme-rtl.css 167 B
build/block-library/blocks/pullquote/theme.css 167 B
build/block-library/blocks/query-pagination-numbers/editor-rtl.css 122 B
build/block-library/blocks/query-pagination-numbers/editor.css 121 B
build/block-library/blocks/query-pagination/editor-rtl.css 221 B
build/block-library/blocks/query-pagination/editor.css 211 B
build/block-library/blocks/query-pagination/style-rtl.css 288 B
build/block-library/blocks/query-pagination/style.css 284 B
build/block-library/blocks/query-title/style-rtl.css 63 B
build/block-library/blocks/query-title/style.css 63 B
build/block-library/blocks/query/editor-rtl.css 440 B
build/block-library/blocks/query/editor.css 440 B
build/block-library/blocks/quote/style-rtl.css 213 B
build/block-library/blocks/quote/style.css 213 B
build/block-library/blocks/quote/theme-rtl.css 223 B
build/block-library/blocks/quote/theme.css 226 B
build/block-library/blocks/read-more/style-rtl.css 132 B
build/block-library/blocks/read-more/style.css 132 B
build/block-library/blocks/rss/editor-rtl.css 202 B
build/block-library/blocks/rss/editor.css 204 B
build/block-library/blocks/rss/style-rtl.css 289 B
build/block-library/blocks/rss/style.css 288 B
build/block-library/blocks/search/editor-rtl.css 165 B
build/block-library/blocks/search/editor.css 165 B
build/block-library/blocks/search/style-rtl.css 409 B
build/block-library/blocks/search/style.css 406 B
build/block-library/blocks/search/theme-rtl.css 114 B
build/block-library/blocks/search/theme.css 114 B
build/block-library/blocks/separator/editor-rtl.css 146 B
build/block-library/blocks/separator/editor.css 146 B
build/block-library/blocks/separator/style-rtl.css 234 B
build/block-library/blocks/separator/style.css 234 B
build/block-library/blocks/separator/theme-rtl.css 194 B
build/block-library/blocks/separator/theme.css 194 B
build/block-library/blocks/shortcode/editor-rtl.css 474 B
build/block-library/blocks/shortcode/editor.css 474 B
build/block-library/blocks/site-logo/editor-rtl.css 490 B
build/block-library/blocks/site-logo/editor.css 490 B
build/block-library/blocks/site-logo/style-rtl.css 203 B
build/block-library/blocks/site-logo/style.css 203 B
build/block-library/blocks/site-tagline/editor-rtl.css 86 B
build/block-library/blocks/site-tagline/editor.css 86 B
build/block-library/blocks/site-title/editor-rtl.css 116 B
build/block-library/blocks/site-title/editor.css 116 B
build/block-library/blocks/site-title/style-rtl.css 57 B
build/block-library/blocks/site-title/style.css 57 B
build/block-library/blocks/social-link/editor-rtl.css 184 B
build/block-library/blocks/social-link/editor.css 184 B
build/block-library/blocks/social-links/editor-rtl.css 674 B
build/block-library/blocks/social-links/editor.css 673 B
build/block-library/blocks/social-links/style-rtl.css 1.4 kB
build/block-library/blocks/social-links/style.css 1.39 kB
build/block-library/blocks/spacer/editor-rtl.css 332 B
build/block-library/blocks/spacer/editor.css 332 B
build/block-library/blocks/spacer/style-rtl.css 48 B
build/block-library/blocks/spacer/style.css 48 B
build/block-library/blocks/table/editor-rtl.css 457 B
build/block-library/blocks/table/editor.css 457 B
build/block-library/blocks/table/style-rtl.css 636 B
build/block-library/blocks/table/style.css 635 B
build/block-library/blocks/table/theme-rtl.css 184 B
build/block-library/blocks/table/theme.css 184 B
build/block-library/blocks/tag-cloud/style-rtl.css 251 B
build/block-library/blocks/tag-cloud/style.css 253 B
build/block-library/blocks/template-part/editor-rtl.css 404 B
build/block-library/blocks/template-part/editor.css 404 B
build/block-library/blocks/template-part/theme-rtl.css 101 B
build/block-library/blocks/template-part/theme.css 101 B
build/block-library/blocks/text-columns/editor-rtl.css 95 B
build/block-library/blocks/text-columns/editor.css 95 B
build/block-library/blocks/text-columns/style-rtl.css 166 B
build/block-library/blocks/text-columns/style.css 166 B
build/block-library/blocks/verse/style-rtl.css 87 B
build/block-library/blocks/verse/style.css 87 B
build/block-library/blocks/video/editor-rtl.css 691 B
build/block-library/blocks/video/editor.css 694 B
build/block-library/blocks/video/style-rtl.css 179 B
build/block-library/blocks/video/style.css 179 B
build/block-library/blocks/video/theme-rtl.css 139 B
build/block-library/blocks/video/theme.css 139 B
build/block-library/classic-rtl.css 162 B
build/block-library/classic.css 162 B
build/block-library/common-rtl.css 1.05 kB
build/block-library/common.css 1.05 kB
build/block-library/editor-elements-rtl.css 75 B
build/block-library/editor-elements.css 75 B
build/block-library/editor-rtl.css 11.7 kB
build/block-library/editor.css 11.7 kB
build/block-library/elements-rtl.css 54 B
build/block-library/elements.css 54 B
build/block-library/index.min.js 197 kB
build/block-library/reset-rtl.css 478 B
build/block-library/reset.css 478 B
build/block-library/style-rtl.css 12.4 kB
build/block-library/style.css 12.4 kB
build/block-library/theme-rtl.css 716 B
build/block-library/theme.css 721 B
build/block-serialization-default-parser/index.min.js 1.13 kB
build/block-serialization-spec-parser/index.min.js 2.83 kB
build/blocks/index.min.js 50.4 kB
build/components/index.min.js 204 kB
build/components/style-rtl.css 11.7 kB
build/components/style.css 11.7 kB
build/compose/index.min.js 12.3 kB
build/core-data/index.min.js 15.9 kB
build/customize-widgets/index.min.js 11.7 kB
build/customize-widgets/style-rtl.css 1.41 kB
build/customize-widgets/style.css 1.41 kB
build/data-controls/index.min.js 663 B
build/data/index.min.js 8.14 kB
build/date/index.min.js 32.1 kB
build/deprecated/index.min.js 518 B
build/dom-ready/index.min.js 336 B
build/dom/index.min.js 4.74 kB
build/edit-navigation/index.min.js 16.2 kB
build/edit-navigation/style-rtl.css 4.14 kB
build/edit-navigation/style.css 4.15 kB
build/edit-post/classic-rtl.css 571 B
build/edit-post/classic.css 571 B
build/edit-post/index.min.js 34.7 kB
build/edit-post/style-rtl.css 7.49 kB
build/edit-post/style.css 7.48 kB
build/edit-site/index.min.js 63.6 kB
build/edit-site/style-rtl.css 9.08 kB
build/edit-site/style.css 9.08 kB
build/edit-widgets/index.min.js 16.8 kB
build/edit-widgets/style-rtl.css 4.48 kB
build/edit-widgets/style.css 4.49 kB
build/editor/index.min.js 44.1 kB
build/editor/style-rtl.css 3.69 kB
build/editor/style.css 3.68 kB
build/element/index.min.js 4.72 kB
build/escape-html/index.min.js 548 B
build/experiments/index.min.js 882 B
build/format-library/index.min.js 7.2 kB
build/format-library/style-rtl.css 598 B
build/format-library/style.css 597 B
build/hooks/index.min.js 1.66 kB
build/html-entities/index.min.js 454 B
build/i18n/index.min.js 3.79 kB
build/is-shallow-equal/index.min.js 535 B
build/keyboard-shortcuts/index.min.js 1.79 kB
build/keycodes/index.min.js 1.86 kB
build/list-reusable-blocks/index.min.js 2.13 kB
build/list-reusable-blocks/style-rtl.css 865 B
build/list-reusable-blocks/style.css 865 B
build/media-utils/index.min.js 2.94 kB
build/notices/index.min.js 977 B
build/nux/index.min.js 2.07 kB
build/nux/style-rtl.css 775 B
build/nux/style.css 771 B
build/plugins/index.min.js 1.95 kB
build/preferences-persistence/index.min.js 2.23 kB
build/preferences/index.min.js 1.35 kB
build/primitives/index.min.js 960 B
build/priority-queue/index.min.js 1.59 kB
build/react-i18n/index.min.js 702 B
build/react-refresh-entry/index.min.js 8.44 kB
build/react-refresh-runtime/index.min.js 7.31 kB
build/redux-routine/index.min.js 2.75 kB
build/reusable-blocks/index.min.js 2.26 kB
build/reusable-blocks/style-rtl.css 283 B
build/reusable-blocks/style.css 283 B
build/rich-text/index.min.js 10.7 kB
build/server-side-render/index.min.js 2.09 kB
build/shortcode/index.min.js 1.52 kB
build/style-engine/index.min.js 1.53 kB
build/token-list/index.min.js 650 B
build/url/index.min.js 3.7 kB
build/vendors/inert-polyfill.min.js 2.48 kB
build/vendors/react-dom.min.js 41.8 kB
build/vendors/react.min.js 4.02 kB
build/viewport/index.min.js 1.09 kB
build/warning/index.min.js 280 B
build/widgets/index.min.js 7.27 kB
build/widgets/style-rtl.css 1.21 kB
build/widgets/style.css 1.21 kB
build/wordcount/index.min.js 1.06 kB

compressed-size-action

@kevin940726 kevin940726 self-assigned this Nov 16, 2022
Copy link
Member

@Mamaduka Mamaduka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. Can you provide instructions for testing with the forked repo?

@kevin940726
Copy link
Member Author

Sure. It's gonna be very tedious though so bear with me if I miss any detail 😅 .

  1. Fork the repo.
  2. Go to .github/workflows/end2end-test.yml.
  3. Change the two occurrences of WordPress/gutenberg to your own username (e.g. kevin940726/gutenberg) to allow running them in your fork.
  4. Create some intentional flaky tests. An example written in Playwright will be:
const { test, expect } = require( '@wordpress/e2e-test-utils-playwright' );

test.describe( 'Flaky test', () => {
	test( 'should be flaky', async ( {}, testInfo ) => {
		expect( testInfo.retry ).toBeGreaterThan( 1 );
	} );
} );
  1. Push the change to trunk, so that the workflow will be updated.
  2. Wait for the CI to finish, it should post a comment in the commit and open an issue for the flaky test.
  3. (Bonus) Add some random change in a new branch and open a PR. The workflow should post a comment on the PR's commit.

@glendaviesnz
Copy link
Contributor

glendaviesnz commented Nov 24, 2022

This worked as advertised for me on a repo fork. I wonder if it would be better to add a comment to the PR rather than the commit when it is run against a PR as the comment against the commit can be easily overlooked, eg. below the size change comment is much more obvious than the flakey test comment bubble against the commit:

Screen Shot 2022-11-24 at 3 27 15 PM

@kevin940726
Copy link
Member Author

I wonder if it would be better to add a comment to the PR rather than the commit when it is run against a PR as the comment against the commit can be easily overlooked

True! I've thought about something similar. Since these comments are actually related to the commits but not the PR, we will need to update the comment to include the commit hash or link to clarify that. I wonder if we should still keep the comment for older commits too, then we might want to keep posting to the commit. I guess posting to the commit is a good enough solution for now, and we can iterate if we find it helpful.

@glendaviesnz
Copy link
Contributor

I guess posting to the commit is a good enough solution for now, and we can iterate if we find it helpful.

Yeh, adding a comment to the PR could be a follow up.

@kevin940726 kevin940726 marked this pull request as ready for review December 5, 2022 10:14
@kevin940726
Copy link
Member Author

@youknowriad Curious about your thoughts 🙈.

@youknowriad
Copy link
Contributor

Sounds good to me. 👍

Makes me thing we're starting to have more and more summaries: flaky tests, bundle size (potentially codesandbox later). A great UX would be a single "PR summary" comment with multiple collapsible details :)

@kevin940726
Copy link
Member Author

A great UX would be a single "PR summary" comment with multiple collapsible details

Yeah, agree! FWIW, this PR doesn't post comments to PRs, yet. Would be nice if we could build a preview website for more advanced content too. For instance, I've been wanting to deploy the Playwright test report HTML somewhere for easier debuggability.

@glendaviesnz
Copy link
Contributor

As a way forward maybe we should:

  1. Merge this version
  2. Iterate on it to get it posting a PR comment
  3. Look at integrating the various reports into a single PR summary

@kevin940726
Copy link
Member Author

Agreed! Would love to get some reviews! I'll rebase it to resolve the conflicts later.

@kevin940726 kevin940726 force-pushed the add/flaky-tests-comment branch from 88c5940 to 5779016 Compare December 15, 2022 05:29
@kevin940726
Copy link
Member Author

kevin940726 commented Dec 15, 2022

@youknowriad Seems like we have to remove/change the required checks if this is merged. 🙇

image

Copy link
Contributor

@talldan talldan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty happy with this code-wise and the idea seems good. I left a few questions.

I couldn't quite get it working on my fork, I'm not sure if I did something wrong - talldan@32d5f92

.github/workflows/end2end-test.yml Show resolved Hide resolved
.github/workflows/end2end-test.yml Show resolved Hide resolved
.github/workflows/end2end-test.yml Show resolved Hide resolved
@kevin940726
Copy link
Member Author

I couldn't quite get it working on my fork, I'm not sure if I did something wrong

Thanks for testing it out!

It appears that the problem is that the "Report to GitHub" step only runs in the context of trunk, so this PR has to be applied to the fork's trunk to test.

@kevin940726 kevin940726 force-pushed the add/flaky-tests-comment branch from 7311857 to f9cb3fd Compare December 15, 2022 08:57
Copy link
Contributor

@talldan talldan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works nicely. An example below (there were more flaky tests than I expected, looks like I didn't need to make a fake one 😄 ):
talldan@fd0e1a0#commitcomment-93272254

We'll have to figure out what to do about the required checks.

@kevin940726
Copy link
Member Author

there were more flaky tests than I expected

Yeah, I noticed the same thing too 😆. Hence that's why we need this for more visible feedback 😅 .

We'll have to figure out what to do about the required checks.

I think here is the doc for that, under step 7.

@talldan talldan merged commit 6888113 into trunk Dec 16, 2022
@talldan talldan deleted the add/flaky-tests-comment branch December 16, 2022 07:33
@github-actions github-actions bot added this to the Gutenberg 14.9 milestone Dec 16, 2022
@Mamaduka
Copy link
Member

Nice work, @kevin940726!

By the way, I'm trying to fix the font size picker related flaky tests in #46591.

@dmsnell
Copy link
Member

dmsnell commented Dec 16, 2022

This has already introduced an amplification of noise in my GitHub notifications. Given how flakey our tests are it seems like this risks making the flakiness worse, because now not only do those tests not pass, but they also leave comments we have to ignore or delete, or deal with every time we commit or rebase a branch. That is, not only do we have to re-run the tests but we also have to look past these comments, and spend the time navigating to the PR to realize that they were only a distraction.

Would love it if instead of adding more noise we could either update an existing comment or simply mark the flakiness somewhere where it only interferes with the people working on the test-flakiness problem. Even with one comment per PR we're still talking about adding unnecessary comments and notifications on a large percentage of the contributions.

I do understand the desire to tell people who aren't familiar with how unreliable our tests are that they shouldn't be bothered by the fact that the tests failed, but if we end up doing that by leaving twenty comments on their work we might just make the problem worse 😉

@dmsnell
Copy link
Member

dmsnell commented Dec 16, 2022

I received yet another notification for these comments while I was writing the above reply. It's worth observing that we don't know from the GitHub notification screen what is demanding our attention until we click through and navigate to the PR and (in some cases scroll down the page and) find the comment and realize it was reiterating that our test suite has failed us. That's frustrating.

@youknowriad
Copy link
Contributor

I've been thinking about this a bit. The main reason we have flaky tests and retries is to actually prevent the job from failing and not care about these. If the goal is to actually make people care about these tests and do something about them, I think we should probably just go back to what we had before:

  • Remove retries and flakiness reports which forces people to either fix or skip the flaky tests.

@talldan
Copy link
Contributor

talldan commented Dec 19, 2022

Feels to me like we tried that for years, and it didn't work. There were still lots of flaky tests.

I think there were more people interested in fixing them, but only because it was such a bad situation and a terrible developer experience.

@kevin940726
Copy link
Member Author

If the goal is to actually make people care about these tests and do something about them

Not entirely true. The goal is to make these failures more visible as opposed to hiding them in each issue while also unblocking the contributors. So flaky tests won't block the PR but just post a comment to the commit.

This however is getting annoying because we currently have too many flaky tests. I don't expect it to be posting too often if our tests are more stable. I totally agree that the notification is a bit too much though. It'd be nice if we could prevent the comment from sending notifications to the commit authors, but that doesn't seem possible on GitHub.

Would only posting/updating a single comment on PRs as suggested above sound like a good enough middle ground? In the meantime, we can continue trying to fix as many highly frequent flaky tests as possible.

@youknowriad
Copy link
Contributor

Feels to me like we tried that for years, and it didn't work. There were still lots of flaky tests.

Sure, but so far, it doesn't feel like the new situation is any better. At least, that's my impression

@dmsnell
Copy link
Member

dmsnell commented Dec 19, 2022

Would only posting/updating a single comment on PRs #45798 (comment) sound like a good enough middle ground? In the meantime, we can continue trying to fix as many highly frequent flaky tests as possible.

What's the benefit of the comment? Maybe a clean revert would be most prudent to avoid leaving in parts that don't get updated.

If anything we could add a comment at the top of the project README telling people not to worry about our poor tests. We already know which ones are flakey (at least hypothetically) so why not make those tests optional for a merge? That is, if the test suite fails, show the results, but don't block the PR?

The script I wrote for collecting runtime measurements of our Workflows could be easily modified to rerun failed test suites. If we export the list of individual test cases that fail too it could watch for new flakiness. We could even theoretically call out to that server and ask, "given these test results, are there failures that aren't known to be flakey right now?" and use the response to determine whether to clear or reject the PR.

Any option seems better than daily repeatedly reminding every one of us that our tests are unreliable 🙃

@kevin940726
Copy link
Member Author

kevin940726 commented Dec 20, 2022

What's the benefit of the comment?

The motivation is mentioned in the "Why?" section in the PR description. The goal is to make flaky tests more visible during commits/reviews so that we get a chance to fix them. I agree though that the pings seem a bit too much at this stage. We can definitely revert this but I still think we need another way to solve this problem or we will just go back to introducing more flaky tests to the project.

We already know which ones are flakey (at least hypothetically) so why not make those tests optional for a merge? That is, if the test suite fails, show the results, but don't block the PR?

The problem actually lies in the hypothetical part. We don't know if a test is flaky until we successfully retry the failed test. A less flaky one could pass a hundred times but fail the 101st time. We don't know if any given PR solves the flakiness without monitoring it for some time either. As for within a single run, we are already retrying failed tests, and if they pass, we don't block the PR.

The script I wrote for collecting runtime measurements of our Workflows could be easily modified to rerun failed test suites.

This works but rerunning the whole workflow is time-consuming and costly. That's why I went for rerunning only the failed test cases.

If we export the list of individual test cases that fail too it could watch for new flakiness.

I don't think I understand this part. Do we commit the list to the repo? Or store them elsewhere? One thing to consider is that PRs can be branched from any given point of time and a static list stored centrally could be easily out of sync.

Any option seems better than daily repeatedly reminding every one of us that our tests are unreliable 🙃

FWIW, this system won't post comments if there are no flaky tests in the run. I don't expect this to be a daily reminder if we can improve the stability of our tests. (But still, the notification is kind of annoying 😅)

@dmsnell
Copy link
Member

dmsnell commented Dec 20, 2022

A less flaky one could pass a hundred times but fail the 101st time.

In my experience, which could be wrong based on my perception, we're not talking about 1 in 101 test runs that fails. It's more like 1 in 2 test runs, or 3 in 5.

This works but rerunning the whole workflow is time-consuming and costly. That's why I went for rerunning only the failed test cases.

Here I think we're talking about the same thing, but re-running the tests isn't really that time-consuming or costly from a scripting point of view. My script as-written monitors tests and every five minutes tries to rerun them. This is how I've been able to get hundreds of runs for a PR without even trying.

We could modify my "without optimizations branch" approach and create what is essentially an empty branch that tracks trunk and watches test failures.

I don't think I understand this part. Do we commit the list to the repo?

We have Workflows and test suites and individual test cases within those suites. we could potentially write out into an artifact the results of each test case (probably with some JSON-export for jest) and then look at which individual test cases failed. There's a URL at which we can grab each artifact from the test suites.

This has actually been on my plans for the performance tests CI job except it took a lower priority once it was no longer the slowest CI workflow.

I don't expect this to be a daily reminder if we can improve the stability of our tests

In practice I'm worried this will always be the case. I've had maybe ten comments today through rebases while working on trying to understand why all performance tests suddenly stopped working. It's not the biggest problem, but it's the kind of frustration that feels incredibly annoying. It's possibly more prominent for me because I'm trying to fix the root problems behind these issues.

@kevin940726
Copy link
Member Author

@dmsnell I'm failing to understand some details here, and I think maybe we're both missing some context. Let me explain how the flaky tests system currently works so that we can be sure we're on the same page. 🙇

  1. E2E tests run on CI (both Puppeteer and Playwright) will auto-retry at most 2 times (3 times in total) until they pass.
  2. Passed tests that have been retried before are marked as "flaky tests". They don't block the CI.
  3. After all tests are finished, we gather all the flaky tests into an artifact.
  4. On a different job, we download the artifact and report each flaky test to its corresponding tracking issue automatically.

All of the above already work before this PR. The motivation for retrying failed tests is to help unblock the contributors so that they don't have to examine those flaky ones individually and manually rerun the workflow. However, this comes with the cost of hiding the flaky tests in each issue, rendering them less visible, hence making our project flakier as PR authors won't catch that.

What this PR does is add a fifth step to aggregate those flaky tests into a comment. We try to make it clear in the comment to note that the flaky tests probably aren't related to the PR itself, but the information is still there for examination.

The problem now is that the notification is too noisy. Possibly only creating a single comment to the PR (but not the commit) is a better alternative.

Hope that this explains the situation a little bit better. Any other solution is always welcome though. For instance, I've been wanting to build an external dashboard to host all these flaky test data elsewhere so that we can do advanced data visualization if possible.

but re-running the tests isn't really that time-consuming or costly from a scripting point of view

I believe it's still a cost to our CI runner, as we don't have unlimited parallel run durations on GitHub.

In practice I'm worried this will always be the case.

Might be, but I still believe we can improve our tests' stability so much more. For instance, the recent Playwright migration tends to have more stable tests than the Puppeteer one (just a feeling there's no proof yet 😅 ).

Could you explain further in steps what you have in mind about the "scripting" approach and how it would work?

@dmsnell
Copy link
Member

dmsnell commented Dec 20, 2022

I think maybe we're both missing some context.

probably! just to affirm though, thanks again for working on the testing suites. it's a terribly frustration with the project and I'm glad people are trying to make it better

What this PR does is add a fifth step to aggregate those flaky tests into a comment. We try to make it clear in the comment to note that the flaky tests probably aren't related to the PR itself, but the information is still there for examination.

For this PR maybe this is my biggest gripe. It feels like we're punishing the wrong people, and making noise where it isn't needed. E.g. if the tests are failing in my PR I'm already going to look at those tests and see if they seem related to my changes or not. As anyone who works in the repo should learn rather quickly, if a test fails, it's most likely caused by broken tests and not by broken code.

A simple note in the README or on first contributors' PRs might be a way to better communicate this: it's not you, it's our infrastructure.

I've been wanting to build an external dashboard to host all these flaky test data

This seems like a better direction IMO to this problem than nagging people trying to contribute to the project. This is a framework level problem, or an education problem, or both. I'd rather have a central place to review to see and work on fixing the broken tests (though I thought that was the point of the issues already, and if that's correct, then it's another reminder that our process is the problem since we haven't prioritizing fixing the tests we know are unreliable).

the recent Playwright migration tends to have more stable tests than the Puppeteer one

I have a growing suspicion that what we saw early on was more due to the fact that there were fewer tests written in Puppeteer and also that we're doing a lot more fixturing and mocking. This as a tradeoff between writing tests that won't fail as often, but in exchange give up testing the behaviors they are there to assert. as we continue to fill out those Playwright tests, let's see how the reliability holds. as we see more and more tests succeed when the app fails, because of that fixturing, we might end up unfeeling that faux reliability.

Could you explain further in steps what you have in mind about the "scripting" approach and how it would work?

Oh it's probably not any different than what's already there. The idea was more like what you were talking about with a dashboard and one that would reach out to re-run the failing flakey tests more often. I don't know if 2 or 3 times is enough to get past the flakiness typically, and I don't know - does our existing watcher track things around the repo or just on individual PRs?

I am of the understanding that our flakey test watcher is not monitoring the performance tests, but when something happens as it did yesterday and suddenly all Performance Tests workflow runs fail, it would be nice to alert that.

this comes with the cost of hiding the flaky tests in each issue, rendering them less visible, hence making our project flakier as PR authors won't catch that.

concluding, and bringing this back to the first point, I don't follow this argument. we are not likely to catch the introduction of flakey tests in the PR that introduces them. this is because when they are introduced we can't know yet if they are flakey; it's likely that during development of a branch some tests will fail and then be resolved due to actual code problems and not due to infrastructure problems.

however, it's only after merging those tests into the mainline that we will learn they are flakey, and at that point it's too late to catch. they aren't hidden in PRs by letting those PRs pass; they just aren't obstructing developers for the problems someone else (or their past self) introduced.

if we have known flakey tests in tracking issues already why do we think that telling random PR authors is going to help. maybe it cools their anxiety, but only until they learn they should assume failed tests are just that - failed tests.

@kevin940726
Copy link
Member Author

Thanks! I appreciate your feedback a lot too!

if the tests are failing in my PR I'm already going to look at those tests and see if they seem related to my changes or not. As anyone who works in the repo should learn rather quickly, if a test fails, it's most likely caused by broken tests and not by broken code.

It's not what I experienced though. People often are confused about the broken tests and have no clue whether it's related to their PR or not. I've been asked a few times already. Seeing the big red cross on one's PR is somewhat discouraging, no matter if it's consciously. It also forbids the PR from being merged, so a maintainer has to jump in, double-check that it's unrelated to the PR, and re-run the job until it passes.

A simple note in the README or on first contributors' PRs might be a way to better communicate this: it's not you, it's our infrastructure.

The comment is the note IMO. Instead of claiming upfront that our tests are not stable and risking people not trusting our tests or even avoiding contributing to them, we only leave a note when there are flaky tests being caught.

I'd rather have a central place to review to see and work on fixing the broken tests (though I thought that was the point of the issues already, and if that's correct, then it's another reminder that our process is the problem since we haven't prioritizing fixing the tests we know are unreliable).

We do have the issues list as a central place to view all the flaky tests. They were reported silently so we had to routinely go to the list to look for something to fix. This PR makes them more visible during PRs, so hopefully we can fix them faster. The Playwright migration is also targeting the flakiest ones to migrate first, so we're prioritizing it, just that there aren't many people working on it right now 😅 .

I have a growing suspicion that what we saw early on was more due to the fact that there were fewer tests written in Puppeteer and also that we're doing a lot more fixturing and mocking.

We already have 355 test cases in Playwright, compared to 449 test cases in Puppeteer. By doing a quick (unverified) search, out of 216 total reported flaky tests, there are currently 174 of them written with Puppeteer. Among the 39 open ones, there are currently 29 of them in Puppeteer as well. This doesn't prove much though as I didn't take flakiness into account.

We don't do fixturing or mocking on stuff that is the focus of the test. We usually only do them for clearing/resetting the state which IMO is the best practice. For instance, we don't manually go to the posts page to delete posts between each test while we can just test it once and call the API in other places to do the same thing in a much faster and more reliable way. We had a discussion about this a while ago and we might bring it up again in a more formal manner via an RFC PR.

I don't know if 2 or 3 times is enough to get past the flakiness typically, and I don't know - does our existing watcher track things around the repo or just on individual PRs?

If it fails too often (more than 2 times in a row) then it's probably a sign that the test is too flaky to be considered valuable, and we should fix it or skip it ASAP. We track it whenever the e2e test job is run, that is, any commit to trunk/release/wp and PR.

I am of the understanding that our flakey test watcher is not monitoring the performance tests, but when something happens as it did yesterday and suddenly all Performance Tests workflow runs fail, it would be nice to alert that.

I'm not sure if it makes sense to monitor the performance test though. If something breaks the performance test then we should just try fixing it instead of letting it pass with warnings. That's what the check is for in the first place, isn't it?

however, it's only after merging those tests into the mainline that we will learn they are flakey, and at that point it's too late to catch.
if we have known flakey tests in tracking issues already why do we think that telling random PR authors is going to help.

I agree, this is a valid concern. However, without a dashboard, I think this is the best option that I can think of for now 😞.

I should clarify though that the comment isn't actually for PR authors, but more for maintainers or experienced contributors who also happen to be PR authors. They are encouraged to review the flaky tests and keep monitoring them if needed. It should be a goal for us to keep improving the stability of our test cases and the comment is just a tool to help make it more visible.


I opened #46785 to make it only post the comment on the PR but not on commits. LMK if that's better and we can merge and iterate from there.

@dmsnell
Copy link
Member

dmsnell commented Jan 4, 2023

It also forbids the PR from being merged, so a maintainer has to jump in, double-check that it's unrelated to the PR, and re-run the job until it passes…Instead of claiming upfront that our tests are not stable and risking people not trusting our tests or even avoiding contributing to them, we only leave a note when there are flaky tests being caught.

This makes sense, but I think the problem is fairly widespread and unrelated to the changes in any given PR. I don't have numbers on how many PRs experience flakey tests, but I'm curious if it's normal to have a PR that doesn't. If every PR that I propose gets this alarm then I don't understand how having the comment builds any more trust than a notice up-front.

I'm reminded that many open-source projects ship with partially-failing test suites. The reality is that our tests don't warrant trust.

This PR makes them more visible during PRs, so hopefully we can fix them faster.

This is where I keep asking who the target audience is. On one hand we're talking about alerting feature developers that they aren't the reason the tests failed, but on the other we're talking about making flakey tests more visible to the infrastructure folks.

How are we going to test if this leads to faster response? How does this make these tests more visible by scattering them across individual PRs given that they are already collected in one place? If we haven't seen people go to the issues list and pick up flakey tests, what leads us to believe they will scan through random PRs and look for a comment that may or may not be there, which will potentially show them tests they already came by?

If something breaks the performance test then we should just try fixing it instead of letting it pass with warnings.

That's the same kind of wishful thinking that I think is the reason that education/lists/comments won't help with our flakey tests. We are fully aware that we have flakey tests but we haven't prioritized fixing them. If perf tests have issues they will have to go through the same prioritization. Right now we have flakey tests in the perf test suite, we just don't monitor them. "Just fixing it" hasn't been working.

That's what the check is for in the first place, isn't it?

Not to my understanding. It's true that if the perf tests fail then the PR is rejected. I have suggested we lift this because the point of those tests is to monitor the performance of the editor. The other E2E suites are there for asserting correct behaviors.

If a perf tests fail because of a flakey test it doesn't mean anything other than the test suites are broken. It's more likely I suspect that if an E2E test fails it's because of a real failure (compared to the perf tests) since those tests are intended to track potentially risky flows whereas the perf tests are kind of assuming things work and never expect to fail because of a real flaw.


At a minimum what are your plans for measuring the impact of these changes? How have you prosed we know if this alert is achieving its goal(s)?

@kevin940726
Copy link
Member Author

I don't have numbers on how many PRs experience flakey tests, but I'm curious if it's normal to have a PR that doesn't.

The goal is to minimize the number of flaky tests in the project. After the refactoring, Playwright tests tend to have less flaky tests, often times zero, which is proof that it is possible.

Tests have less value if they are flaky, we don't know if it's flaky because of poorly written test or if the functionality is actually broken sometimes.

Note that keeping the tests stable is also an encouragement for contributors to write more tests, which eventually makes our project more stable. If we don't trust our tests, then nobody will, we are better off just don't write any tests at all.

This is where I keep asking who the target audience is. On one hand we're talking about alerting feature developers that they aren't the reason the tests failed, but on the other we're talking about making flakey tests more visible to the infrastructure folks.

This is unfortunately true, but I don't think we have better options. The comment will notify the reviewers, which are often maintainers of the project who should be actively monitoring the overall health of our tests.

If we haven't seen people go to the issues list and pick up flakey tests, what leads us to believe they will scan through random PRs and look for a comment that may or may not be there, which will potentially show them tests they already came by?

We are fully aware that we have flakey tests but we haven't prioritized fixing them.

I have been actively working on fixing many flaky tests, along with many other contributors. Fixing flaky tests is always a priority, just not that high compared to other important features. This is a maintenance job that we just have to keep doing. Perhaps making the report comment more visible will draw more people to help on this too.

Right now we have flakey tests in the perf test suite, we just don't monitor them. "Just fixing it" hasn't been working.

I'm not familiar with the perf test, but I'm sure there are folks actively working on maintaining them, aren't there?

At a minimum what are your plans for measuring the impact of these changes? How have you prosed we know if this alert is achieving its goal(s)?

This system so far has been helpful for me and other folks to prioritize the flaky tests that we want to fix. I'd say that it's achieving its goal already. Such things might be difficult to measure, but as someone actively working on writing/fixing/migrating/refactoring/reviewing e2e tests, I'd say this is worth the effort.

Of course though, if anyone thinks strongly against this that it's still too annoying for them then we can always revert it as you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Type] Automated Testing Testing infrastructure changes impacting the execution of end-to-end (E2E) and/or unit tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants