Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General Protection Fault in GitHub Actions container in balin #1273

Open
cbrxyz opened this issue Sep 7, 2024 · 2 comments
Open

General Protection Fault in GitHub Actions container in balin #1273

cbrxyz opened this issue Sep 7, 2024 · 2 comments
Assignees
Labels
infra Related to MIL server infrastructure infra-leads-notify software

Comments

@cbrxyz
Copy link
Member

cbrxyz commented Sep 7, 2024

What needs to change?

When running catkin_make -j10 (with 10 being the number of processes limited to the Docker container), the Python executable is getting killed roughly 20 seconds into the build process, around 15%-20% completion. This issue occurs within a Docker container that has been moved between computers.

some details:

  • The dmesg output shows the following trap:

    [12151.484884] traps: python3[2025829] general protection fault ip:56b5d5 sp:7ffd85f2c0a0 error:0 in python3.8[423000+294000]
    
  • I suspect that the python3 process in the error log corresponds to catkin_make, as catkin_make runs using Python. Other Python processes are running as well, but they appear to exit cleanly.

  • The catkin_make output indicates that it is being terminated unexpectedly.

Possible Causes:

  • This error is happening inside a Docker container, which should be reusable and composable. The error could be hardware-related, as suggested by a similar Proxmox forum post. The container has been moved between computers multiple times, and this is the first occurrence of such errors.

  • Additionally, there is an issue with pip on this computer: it often downloads files with bad CRC checks. While it usually succeeds after a couple of tries, the first attempt often fails. This could be another indicator of a deeper issue.

How would this task be tested?

  1. Ensure that CI is able to run okay!
@cbrxyz cbrxyz added software infra Related to MIL server infrastructure infra-leads-notify labels Sep 7, 2024
@cbrxyz cbrxyz assigned cbrxyz and DaniParr and unassigned andrew-aj Jan 11, 2025
@cbrxyz
Copy link
Member Author

cbrxyz commented Jan 11, 2025

Daniel and I will finish triaging the issues in the server by February 1st, to support the move of our data off of ECE servers.

@cbrxyz
Copy link
Member Author

cbrxyz commented Jan 17, 2025

We ran memtest today, and there was a RAM error on the 17th pass. We're going to try replacing the RAM soon and seeing if that fixes anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra Related to MIL server infrastructure infra-leads-notify software
Projects
None yet
Development

No branches or pull requests

3 participants