Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CT hangs up when checking for stale repos #233

Closed
zepumph opened this issue Feb 5, 2025 · 9 comments
Closed

CT hangs up when checking for stale repos #233

zepumph opened this issue Feb 5, 2025 · 9 comments

Comments

@zepumph
Copy link
Member

zepumph commented Feb 5, 2025

This has happened 3 times. I'll take a look.

@zepumph
Copy link
Member Author

zepumph commented Feb 5, 2025

I cannot be confident about why this problem is happening, and why it only started happening recently, but I added a timeout to the isStale checks. My guess is that this won't totally solve the problem, and that something is causing a memory leak or hangup elsewhere, and we only notice it because the status is set more often from the snapshot loop. No matter, my fingers are crossed.

I asked @jonathanolson on slack if there is worry about the perennial npm run "scenerystack-test" process causing some trouble since it crashes every row during a rollup subprocess (see #226), but I'm not confident at all about the relation.

I just restarted CT with the above changes, and I'll likely close this issue before the multi-week time period it may take to reproduce this problem. We can always reopen if things aren't fixed up.

@zepumph
Copy link
Member Author

zepumph commented Feb 6, 2025

So far so good!

@zepumph zepumph closed this as completed Feb 6, 2025
@zepumph
Copy link
Member Author

zepumph commented Feb 17, 2025

@pixelzoom reported this hanging again. Reopening.

@zepumph
Copy link
Member Author

zepumph commented Feb 17, 2025

After I say pm2 stop all on phet@sparky, here is the list of top:

USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
phet      20   0   30.3g 987372  58484 S  10.5   0.4 472:24.36 node
phet      20   0   29.9g 664596  86508 S  10.5   0.3  84:28.95 node
phet      20   0   30.0g 689580  49132 S  10.2   0.3   9819:19 node
phet      20   0   30.1g 718268  49192 S  10.2   0.3  10035:41 node
phet      20   0   30.0g 668140  49120 S  10.2   0.3   9124:52 node
phet      20   0   30.0g 659180  48856 S  10.2   0.3  10061:58 node
phet      20   0   30.0g 669436  49008 S  10.2   0.3   9568:43 node
phet      20   0   30.2g 811164  48824 S  10.2   0.3  10039:03 node
phet      20   0   30.4g 969976  49400 S  10.2   0.4   9667:12 node
phet      20   0   30.0g 677236  49064 S  10.2   0.3  10010:04 node
phet      20   0   29.9g 651516  48988 S  10.2   0.2   9575:21 node
phet      20   0   30.0g 671316  49028 S  10.2   0.3   9537:05 node
phet      20   0   30.0g 702076  49072 S  10.2   0.3   9677:52 node
phet      20   0   30.0g 687540  49008 S  10.2   0.3  10031:01 node
phet      20   0   30.0g 804748  49004 S  10.2   0.3   9876:39 node
phet      20   0   30.1g 748184  49180 S  10.2   0.3  10029:53 node
phet      20   0   30.2g 903420  49240 S  10.2   0.3   9715:18 node
phet      20   0   30.3g 895688  48948 S  10.2   0.3  10042:07 node
phet      20   0   30.0g 699872  49136 S  10.2   0.3   9818:10 node
phet      20   0   30.2g 887096  49124 S  10.2   0.3  10161:38 node
phet      20   0   30.0g 672292  48988 S  10.2   0.3  10049:26 node
phet      20   0   30.0g 709100  48952 S  10.2   0.3   9486:07 node
phet      20   0   30.2g 847772  48980 S  10.2   0.3  10056:48 node
phet      20   0   30.3g 915792  67656 S  10.2   0.3 392:55.62 node
phet      20   0   30.0g 671868  48808 S  10.2   0.3   9824:44 node
phet      20   0   30.0g 678200  48228 S  10.2   0.3   9880:08 node
phet      20   0   30.1g 706472  49172 S  10.2   0.3   9820:39 node
phet      20   0   30.1g 744028  49108 S  10.2   0.3   9950:55 node
phet      20   0   30.1g 730440  48784 S  10.2   0.3   9471:11 node
phet      20   0   30.3g 896056  49396 S  10.2   0.3   9480:55 node
phet      20   0   30.0g 901796  45468 S  10.2   0.3  10295:58 node
phet      20   0   30.0g 653740  49044 S  10.2   0.2   9273:59 node
phet      20   0   30.3g 907472  49076 S  10.2   0.3   9659:55 node

This seems like a memory leaks of node sub processes.

@zepumph
Copy link
Member Author

zepumph commented Feb 17, 2025

[phet@sparky ~]$ pgrep --count node
111

@zepumph
Copy link
Member Author

zepumph commented Feb 19, 2025

Also mentioning nodejs/node#35973 because our report as a single JSON is messing things up.

zepumph added a commit that referenced this issue Feb 19, 2025
zepumph added a commit that referenced this issue Feb 20, 2025
zepumph added a commit that referenced this issue Feb 20, 2025
zepumph added a commit that referenced this issue Feb 20, 2025
zepumph added a commit that referenced this issue Feb 20, 2025
@zepumph
Copy link
Member Author

zepumph commented Feb 20, 2025

Tagging phetsims/perennial#436

CT is in a much better spot after the above fixes. Here is a summary:

Problems:

  1. perennial scenerystack-test npm-run test failed on every column, in a way that kept dangling node processes.
  2. perennial scenerystack-test npm-run test failed on every column, with a large amount of console log output
  3. perennial type-check --all failed and had a GIANT amount of output (450,000 lines), this failed the JSON.stringify of the report.

Solutions

  1. @jonathanolson removed parcel builds from scenerystack-test so it doesn't fail anymore (phetsims/perennial@654b908)
  2. In that commit scenerystack-test also has less logging output.
  3. In above commits, I truncated the message to 150,000 chars so that one failure can't tank the entire CT reporting system (see RangeError: Invalid string length with large files. nodejs/node#35973 for our v8 string length constraint).
  4. Type checking was failing because parallel processes for lint and type check were running tsc clean operations. That can remove d.ts files that other processes were depending on.

CT is back now!

I may want to revert 5f23e9f, but I'll wait just a bit first.

@zepumph
Copy link
Member Author

zepumph commented Feb 20, 2025

fba531a broke cases where there wasn't a message (for pass results). Fixed now.

@zepumph
Copy link
Member Author

zepumph commented Feb 21, 2025

Alright! All is looking good. I'll reopen if we have more trouble.

@zepumph zepumph closed this as completed Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant