Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: fix CI to run consistent tests 100% of the time #33089

Closed
wants to merge 7 commits into from

Conversation

bongbui321
Copy link
Contributor

@bongbui321 bongbui321 commented Jul 26, 2024

Recently, I only see that mostly unit_test section is failing, and from looking at the failing logs, it seems that there is this Fatal Python error: Bus error that causes crashes in pytest-xdist workers. This error is most likly from workers trying to access a unavailable memory region, so I just increase the shm-size of docker from 1GB -> 2GB.

From running 2 runs of this new size (50 runs of unit_test each), I don't see that error happening anymore. `
run 1: https://github.com/commaai/openpilot/actions/runs/10115487281/job/27976375147
run 2: https://github.com/commaai/openpilot/actions/runs/10115625540/job/27976846602

The failing tests of these two runs are from a known flaky test in test_loggerd.py and I guess it has some weird association with matrix that causing it to fail (didn't investigate further), but should be fine if we have one run of unit_test

I did another run to make sure that if the shm-size is 1GB then we would see the bus error in unittest again

https://github.com/commaai/openpilot/actions/runs/10115830453/job/27977440198

If the issues still continues after this is merged, I'll happy to finish it.

This is probably something due to pytest-xdist running tests that mess around with threading.

resolves #32939

  • serial testing. Works!

Copy link
Contributor

github-actions bot commented Jul 26, 2024

Thanks for contributing to openpilot! In order for us to review your PR as quickly as possible, check the following:

  • Convert your PR to a draft unless it's ready to review
  • Read the contributing docs
  • Before marking as "ready for review", ensure:
    • the goal is clearly stated in the description
    • all the tests are passing
    • the change is something we merge
    • include a route or your device' dongle ID if relevant

@bongbui321 bongbui321 changed the title CI: fix CI to pass run consistent test 100% of the time CI: fix CI to run consistent test 100% of the time Jul 26, 2024
@bongbui321 bongbui321 changed the title CI: fix CI to run consistent test 100% of the time CI: fix CI to run consistent tests 100% of the time Jul 26, 2024
@bongbui321 bongbui321 marked this pull request as ready for review July 26, 2024 18:44
@bongbui321 bongbui321 marked this pull request as draft July 26, 2024 18:48
@bongbui321
Copy link
Contributor Author

wasn't the issue I initially thought it was.

Copy link
Contributor

UI Screenshots

@@ -34,6 +34,7 @@ def test_missing_translation_files(self):
assert os.path.exists(os.path.join(TRANSLATIONS_DIR, f"{self.file}.ts")), \
f"{self.name} has no XML translation file, run selfdrive/ui/update_translations.py"

@pytest.mark.serial
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't translations_dir work to solve this? Is TemporaryDirectory somehow returning the same directory for two workers?

Copy link
Contributor Author

@bongbui321 bongbui321 Jul 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed since I used -n0 for the test_translation.py that make it serial. Serial marked tests are configured to run with single processor.

@bongbui321
Copy link
Contributor Author

bongbui321 commented Jul 27, 2024

Current attempt to serialize tests that seems to fail tests running in parallel with pytest-xdist.

pytest-dev/pytest#3216 (comment)
Problem seems to be with ubuntu github runners and multithreading or multiprocessing tests

Two runs with 50 unit_tests runs each (100% pass)

https://github.com/commaai/openpilot/actions/runs/10120588057
https://github.com/commaai/openpilot/actions/runs/10120477620

@bongbui321
Copy link
Contributor Author

findings:

  • Increase the shm-size seems to not have the python bus error anymore but it still has some random test failures for ones that have multiprocessing
  • After researching for a while, a lot of people seems to have this issue, and it seems to only happen on CI ubuntu machines within a container
  • Seems nearly impossible to root-cause the problem since it is not easily reproducible, too random.
  • Serializing tests that involves multiple processes seems to work

@bongbui321 bongbui321 force-pushed the fix_ci branch 2 times, most recently from baca33a to c97df97 Compare July 27, 2024 15:36
@bongbui321 bongbui321 marked this pull request as ready for review July 27, 2024 16:06
@adeebshihadeh
Copy link
Contributor

  • Increase the shm-size seems to not have the python bus error anymore but it still has some random test failures for ones that have multiprocessing

Are the errors now specific to a particular test? If so, that's worth including here.

  • Serializing tests that involves multiple processes seems to work

It still seems like we don't know what's going on here. What if it's this #27512?

@bongbui321
Copy link
Contributor Author

bongbui321 commented Jul 29, 2024

Are the errors now specific to a particular test? If so, that's worth including here.

Included in comments.

  • test_loggerd.py::test_rotation runs multiple processes
  • test_translation.py has some I/O operations which might be the cause of some failing tests. not sure how workers interact with these I/O ops, high chance a pytest-xdist bug

A lot of people experience these random worker crashes using pytest-xdist, and there haven't been to have good repro or clear solution (comment)

It still seems like we don't know what's going on here. What if it's this #27512?

Did a quick rununit_tests 50 times with deanlee's latest temporary fix and still experiencing these failures (run). This is likely to be a bug within pytest-xdist

@adeebshihadeh
Copy link
Contributor

I looked into this briefly, and seems like there's multiple things going on and the issues are interacting with each other. When that happens, it's important to carefully fix each one.

  • worker crash -> OOM due to shm size
  • test_rotation failure -> encoderd is crashing

This seems quite reproducible, so we should be able to make progress on it.

I pushed the shm bump here: 76fd5b0

encoderd crash from https://github.com/commaai/openpilot/actions/runs/10174071612/job/28139196053:

>       assert len(diff) == 0, f"didn't get all expected files. run={_} seg={n} {route_path=}, {diff=}\n{logged=} {expected_files=}"
E       AssertionError: didn't get all expected files. run=(1928, 1208, 4804608, 2048, 2490368) seg=0 route_path='/root/.commadd14e5bfc945470/media/0/realdata/00000000--7bf74a8990', diff={'ecamera.hevc', 'fcamera.hevc', 'dcamera.hevc', 'qcamera.ts'}
E         logged={'qlog', 'rlog'} expected_files={'ecamera.hevc', 'qlog', 'fcamera.hevc', 'rlog', 'dcamera.hevc', 'qcamera.ts'}
E       assert 4 == 0
E        +  where 4 = len({'dcamera.hevc', 'ecamera.hevc', 'fcamera.hevc', 'qcamera.ts'})

system/loggerd/tests/test_loggerd.py:185: AssertionError
----------------------------- Captured stdout call -----------------------------
Starting listener for: camerad
system/loggerd/loggerd.cc: logging to /root/.comma81d1a5fdea744a9/media/0/realdata/00000000--e3cf4e4207--0
system/loggerd/loggerd.cc: closing logger
----------------------------- Captured stderr call -----------------------------
encoderd: msgq_repo/msgq/msgq.cc:232: int msgq_msg_send(msgq_msg_t *, msgq_queue_t *): Assertion `3 * total_msg_size <= q->size' failed.
----------------------------- Captured stdout call -----------------------------
Starting listener for: camerad
system/loggerd/loggerd.cc: logging to /root/.comma285a06b46ef64a8/media/0/realdata/00000000--6894b06f08--0
system/loggerd/loggerd.cc: closing logger
----------------------------- Captured stderr call -----------------------------
encoderd: msgq_repo/msgq/msgq.cc:232: int msgq_msg_send(msgq_msg_t *, msgq_queue_t *): Assertion `3 * total_msg_size <= q->size' failed.
----------------------------- Captured stdout call -----------------------------
Starting listener for: camerad
system/loggerd/loggerd.cc: logging to /root/.commadd14e5bfc945470/media/0/realdata/00000000--7bf74a8990--0
system/loggerd/loggerd.cc: closing logger
----------------------------- Captured stderr call -----------------------------
encoderd: msgq_repo/msgq/msgq.cc:232: int msgq_msg_send(msgq_msg_t *, msgq_queue_t *): Assertion `3 * total_msg_size <= q->size' failed.

@bongbui321 bongbui321 closed this Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make CI jobs pass 100/100 times
3 participants