Build stall when CAS reset #245

chrisstaite-menlo · 2023-09-07T08:05:55Z

I had everything stall during a full Chromium build, it did all come back to life after a minute or so, but the cause appears to be this:

[2023-09-07T07:59:41.326Z WARN  h2::proto::streams::recv] recv_reset; remotely-reset pending-accept streams reached limit (20)
[2023-09-07T07:59:41.326Z ERROR cas] Failed running service : hyper::Error(Http2, Error { kind: GoAway(b"", ENHANCE_YOUR_CALM, Library) })

I'm not sure why this happened and it's the first time I've seen it.

The text was updated successfully, but these errors were encountered:

chrisstaite-menlo · 2023-09-07T08:33:46Z

Getting quite a lot of this error from workers:

[2023-09-07T08:33:14.560Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/3147cfb3677571c42217a593af517bd73cd66990c8fcd08c9453030000000000-37957\" : Could not upload data to store in upload_file_to_store : for \"/root/.cache/turbo-cache/work/621d69738280b8cb78d47e763a09fc5f8166dfd0567b3523ecaa8659147b735c/by/goma/obj/third_party/blink/renderer/bindings/modules/v8/v8/v8_payment_response.o.d\" : Error while uploading results"] }

allada · 2023-09-07T13:56:29Z

It looks like the CAS was under a lot of load and the scheduler tried to do some lookups but the CAS was unable to process more streams due to some other bottleneck and resulted in the queue filling up, then dropping new connections.

We can mitigate some of this with:
https://doc.servo.org/h2/server/struct.Builder.html#method.max_pending_accept_reset_streams

However, in this case having more than one CAS might be needed in order to deal with excessive loads.

hyperium/h2#668

chrisstaite-menlo · 2023-09-07T13:58:55Z

It was running at a load average of 12 and had 300% CPU usage. I was trying to avoid having multiple CAS instances because it becomes a pain to manage. I may have to migrate to one of the S3 style storage solutions and host Turbo-Cache to front it. It should all have been serving out of the memory store though, so it's surprising that it couldn't handle it.

allada · 2023-09-07T14:05:38Z

I wounder if it got caught up in a lock somewhere? It would be great if this happens again if we could get a core dump.

The only N^2 problem that I am aware of in this project is on the scheduler matching, and since this happened on the CAS i'm certain it's a external influenced thing.

Do you happen to know the disk usage? Tokio's disk commands are not actually async, it actually uses blocking calls in a "block safe thread". By default the system is limited to a fixed amount of these blocking threads. So maybe it ended up with too much backpressure from the disk io?

chrisstaite-menlo · 2023-09-07T14:07:33Z

Didn't appear to have a particularly high wa but that's anecdotal.

chrisstaite-menlo · 2023-09-07T14:09:08Z

I'm actually getting a lot of this even when it's not heavily loaded:

[2023-09-07T14:06:14.316Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/06d463ed3e6600423456cdd4faf3c376f042ec06e38b9f002f2f040000000000-10824\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:06.437Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/bd47cdb64c9fcf99f9b92f91ac9e407be88320381cc73a867d60040000000000-9453\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:07.557Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/bdf50ad4b2d7bf49796d02e6f68de97c8c051c134cbb95ab5461040000000000-9838\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:09.229Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e7856deacd8906918a34ef8af8d16a120ba3d45fc678d8f2b262040000000000-9958\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:14.642Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/0b4fff3fa0cd03d3e65189ba9ee94986feb26539ba1fcc3cfd6a040000000000-9197\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:17.001Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/a350640a5de6855a692c7e81fd9b66dd1e2a27dffcccfd44ba70040000000000-12239\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }

I think we may have introduced a problem recently. I just updated the cluster to the latest main.

Only thing in the CAS logs are:

[2023-09-07T14:02:52.560Z ERROR bytestream_server] Write Resp: 2.7540956 None Status { code: Internal, message: "Stream error at byte 8388608 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.561Z ERROR bytestream_server] Write Resp: 8.560164 None Status { code: Internal, message: "Stream error at byte 25165824 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.586Z ERROR bytestream_server] Write Resp: 9.943344 None Status { code: Internal, message: "Stream error at byte 29360128 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.675Z ERROR bytestream_server] Write Resp: 0.034807287 None Status { code: Internal, message: "Stream error at byte 2097152 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.692Z ERROR bytestream_server] Write Resp: 3.1737354 None Status { code: Internal, message: "Stream error at byte 12582912 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.727Z ERROR bytestream_server] Write Resp: 0.07661839 None Status { code: Internal, message: "Stream error at byte 2097152 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.734Z ERROR bytestream_server] Write Resp: 0.0853766 None Status { code: Internal, message: "Stream error at byte 2097152 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:08:35.537Z ERROR bytestream_server] Write Resp: 0.012445173 None Status { code: Internal, message: "Stream error at byte 20480 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:08:35.559Z ERROR bytestream_server] Write Resp: 0.010165742 None Status { code: Internal, message: "Stream error at byte 24576 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:08:40.401Z ERROR bytestream_server] Write Resp: 0.012167023 None Status { code: Internal, message: "Stream error at byte 65536 : In ByteStreamServer::write()", source: None }

allada · 2023-09-07T14:09:56Z

This would be very useful to get some metrics published under prometheus.

chrisstaite-menlo · 2023-09-07T14:11:10Z

I'm getting this on the worker:

[2023-09-07T08:32:24.893Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c945f030000000000-0\""] }
[2023-09-07T14:08:35.580Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.618Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.645Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.645Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.655Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.655Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.675Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.676Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.676Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.676Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }

allada · 2023-09-07T14:11:29Z

Can you try reverting 67b90e2c9a254687b7525053bb4153f95e216b9d and try?

That commit was significant and is the only major change I can think of that might cause this.

chrisstaite-menlo · 2023-09-07T14:12:49Z

Just doing a build with that reverted now.

chrisstaite-menlo · 2023-09-07T14:42:00Z

Now running with that build and it appears a lot happier, although it's only been up for a few seconds, so I'll try and exercise it to check.

chrisstaite-menlo · 2023-09-07T14:47:52Z

I've just hit it with a totally new branch and it's flying along without any issues. The CAS is running at a load average of 18 and rising and 300% CPU with a 2.8 wa but there's no stalls or breakages.

Looks like there might be an issue with 67b90e2.

allada · 2023-09-07T14:49:40Z

Dang, I will revert that commit. I may need to find another way to do what it's trying to do. I tried a few strategies, but this was the one i hated the least.

chrisstaite-menlo · 2023-09-07T14:55:45Z

I spoke too soon, getting stalls again. When it stalls the scheduler and remoteexec_proxy CPU usage drops to a tiny amount and get these:

[2023-09-07T14:54:23.990Z WARN  simple_scheduler] Internal error for worker 46ecd873-ab4f-4683-958f-c9b5b1db72b1: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/fcbb86c63a3e9f318ef3e082d7cb809dcc9df658dd9b22a71477000000000000-10778 : Could not upload data to store in upload_to_store : Error while uploading results"] }
[2023-09-07T14:54:25.290Z WARN  simple_scheduler] Internal error for worker 46ecd873-ab4f-4683-958f-c9b5b1db72b1: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/fe55b3977c64b4d0408e121f3e62406afb2258b73fbe6e192277000000000000-10888 : Could not upload data to store in upload_to_store : Error while uploading results"] }
[2023-09-07T14:54:27.344Z WARN  simple_scheduler] Internal error for worker dbfeef84-d1ad-4f6f-9f1e-9681195f2e35: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/0d1a3dcffaedfe4a96e4ab48a6b97b627e407045c8393d63a378000000000000-8344 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/7403455e0ecaa6f782afa837ca22e09c11802d6424b144758d4dbec5de1d36f5/by/goma/obj/components/safety_check/safety_check/update_check_helper.o\" : Error while uploading results"] }
[2023-09-07T14:54:36.713Z WARN  simple_scheduler] Internal error for worker 46ecd873-ab4f-4683-958f-c9b5b1db72b1: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/faef7fb3995348a490c099684ef8ff48c5a4d21a44cbaaf55079000000000000-8988 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/a9ef3a5608d364d407ed72a4542393a99aba5bdf327764eca13b7ba4a7f861ad/by/goma/obj/components/security_interstitials/core/core/bad_clock_ui.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:37.671Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/6e6561700bbfbd48a75972966e5bf527a0f7cea14f3ff50cdf77000000000000-9735 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/ac584fe072761ea0a2a91e8f5a0b3a203e9db4880927581c3ce4cf405ef0d7ab/by/goma/obj/components/security_interstitials/core/core/safe_browsing_quiet_error_ui.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:37.976Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/64f350e0acf2449906c1e7dea239a9fb5600e73b8a88f2e2fa77000000000000-8368 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/428835674a916f53f6e8b1fdde29879a719d8a6bb3216f5c954e9a57145861b6/by/goma/obj/components/security_interstitials/core/core/controller_client.o\" : Error while uploading results"] }
[2023-09-07T14:54:38.450Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/148722dcd9a71441c5783525482540187c087ae9d17246f21278000000000000-10085 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/6f35c0b9223d00b71422bec7c5a8a1a585f9bcd44a840064b25871c762071605/by/goma/obj/components/security_interstitials/core/core/https_only_mode_enforcelist.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:38.509Z WARN  simple_scheduler] Internal error for worker dbfeef84-d1ad-4f6f-9f1e-9681195f2e35: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/346bbdc66eeff2390a88ac5a1707ef7195645ef122d9c366bc7a000000000000-11609 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/a03735e7d1e744370ec03f9c39ad99fa6390ce6d607efca2d68a7fa31682cee7/by/goma/obj/components/segmentation_platform/internal/internal/field_trial_recorder.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:39.107Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/9c060331fbf2cf65e497a425ff6f33bde624f64b8fbe41d84178000000000000-11684 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/5cff60d775f277c693c2ad6a7b4e1f39207dcd97ed67beb3a2adb273315382d9/by/goma/obj/components/segmentation_platform/internal/internal/metadata_utils.o.d\" : Error while uploading results"] }

chrisstaite-menlo · 2023-09-07T14:56:48Z

I think before I got these issues I was running a really old version back at around 259224b

chrisstaite-menlo · 2023-09-07T14:58:42Z

I don't suppose it could be 8ac4824? The only other option is b0b66a1 I think.

allada · 2023-09-07T15:09:10Z

Hmmm, would it be too much work to bisect it?

What hash where you running before attempting to upgrade?

allada · 2023-09-07T15:14:12Z

We might be conflating two issues.

The failed to upload stuff looks like it's failing a soft assert:

[WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/06d463ed3e6600423456cdd4faf3c376f042ec06e38b9f002f2f040000000000-10824\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }

Triggered in:
https://github.com/allada/turbo-cache/blob/main/cas/scheduler/simple_scheduler.rs#L468

I could see a couple ways this might happen, but none of them should be critical.

Does this actually fail the build or just put noise in the logs?

chrisstaite-menlo · 2023-09-07T15:46:40Z

The build doesn't fail it gets a RETRY entry in the Goma logs and stalls everything, it's like it locks up. I'm currently trying with just 8ac4824 reverted to see what I get.

chrisstaite-menlo · 2023-09-07T16:00:49Z

Reverting that change does appear to have fixed it. And I think I have a reason why, previously when 128 simultaneous builds produced their outputs they only generated 128 concurrent uploads of results to the CAS. However, now they upload the stdout and stderr too, so they have 384 simultaneous uploads. Combine that with the downloads and you can see how the limit might be reached on the CAS. This is simply a scale thing I think.

chrisstaite-menlo · 2023-09-07T16:02:47Z

Having said that, there are still four RETRY entries in Goma for this build, so perhaps it's still happening sometimes?

chrisstaite-menlo · 2023-09-07T16:04:04Z

I think that these logs in the CAS indicate that an issue occurred:

[2023-09-07T15:57:36.444Z ERROR bytestream_server] Write Resp: 0.16644613 None Status { code: Internal, message: "Stream error at byte 24576 : In ByteStreamServer::write()", source: None }
[2023-09-07T15:58:09.957Z ERROR bytestream_server] Write Resp: 0.15555136 None Status { code: Internal, message: "Stream error at byte 36864 : In ByteStreamServer::write()", source: None }
[2023-09-07T15:58:26.965Z ERROR bytestream_server] Write Resp: 0.004299194 None Status { code: Internal, message: "Stream error at byte 32768 : In ByteStreamServer::write()", source: None }
[2023-09-07T16:02:00.397Z ERROR bytestream_server] Write Resp: 0.015137762 None Status { code: Internal, message: "Stream error at byte 20480 : In ByteStreamServer::write()", source: None }
[2023-09-07T16:02:07.979Z ERROR bytestream_server] Write Resp: 0.024254864 None Status { code: Internal, message: "Stream error at byte 45056 : In ByteStreamServer::write()", source: None }

Although they don't really explain what that issue was...

allada · 2023-09-07T16:07:49Z

In looking into this, it's super strange, the reader and writer are both complaining about each other:

--> Receiver went away before receiving EOF ...
--> Writer was dropped before EOF was sent ...

This is super confusing.

chrisstaite-menlo · 2023-09-07T16:11:00Z

Should there just be a retry in there perhaps?

allada · 2023-09-07T16:22:01Z

I guess we could. I'd rather figure out the root cause though, because it might point to a more fundamental issue.

I'm still investigating what might be going on... The entire reason I creates .err_tip() was because buildbarn had the same issue with not giving much hints into where the problems where for this stuff, and .err_tip() was supposed to help... but I guess it's not enough :-(

allada · 2023-09-07T16:22:32Z

Can you share your worker config file?

chrisstaite-menlo · 2023-09-07T16:29:56Z

{
  "stores": {
    "GRPC_LOCAL_STORE": {
      // Note: This file is used to test GRPC store.
      "grpc": {
        "instance_name": "main",
        "endpoints": ["grpc://${CAS_ENDPOINT:-127.0.0.1:50051}"],
        "store_type": "CAS"
      }
    },
    "GRPC_LOCAL_AC_STORE": {
      // Note: This file is used to test GRPC store.
      "grpc": {
        "instance_name": "main",
        "endpoints": ["grpc://${CAS_ENDPOINT:-127.0.0.1:50051}"],
        "store_type": "AC"
      }
    },
    "WORKER_FAST_SLOW_STORE": {
      "fast_slow": {
        "fast": {
          "filesystem": {
            "content_path": "/root/.cache/turbo-cache/data-worker-test/content_path-cas",
            "temp_path": "/root/.cache/turbo-cache/data-worker-test/tmp_path-cas",
            "eviction_policy": {
              // 1gb.
              "max_bytes": 1000000000,
            }
          }
        },
        "slow": {
          "ref_store": {
            "name": "GRPC_LOCAL_STORE",
          }
        }
      }
    }
  },
  "workers": [{
    "local": {
      "worker_api_endpoint": {
        "uri": "grpc://${SCHEDULER_ENDPOINT:-127.0.0.1:50061}",
      },
      "cas_fast_slow_store": "WORKER_FAST_SLOW_STORE",
      "ac_store": "GRPC_LOCAL_AC_STORE",
      "work_directory": "/root/.cache/turbo-cache/work",
      "platform_properties": {
        "OSFamily": {
          "values": ["Linux"],
        },
        "cpu_count": {
          "query_cmd": "nproc"
        },
      },
      "precondition_script": "/root/precondition_script.sh"
    }
  }],
  "servers": []
}

allada · 2023-09-07T16:44:33Z

Well, found one bug. I'm pretty sure it is not the root cause of anything, but it's still wrong (fast_rx is going to slow store and vice versa):
https://github.com/allada/turbo-cache/blob/main/cas/store/fast_slow_store.rs#L161

chrisstaite-menlo · 2023-09-07T16:46:09Z

That would explain why the log says it's populating the fast store but it's remote...

allada · 2023-09-07T16:49:38Z

I'm pretty sure what is happening is that the GRPC store (or somewhere down stream) is getting the exact number of bytes needed, then closing the stream before the sender is able to send an EOF.

In other words, somewhere down stream says: "oh, i got all the data you are going to send, i might as well close the connection", even though there's a zero byte message that is required to be sent (the EOF message).

If this is the case, I'd bet money that it's on the fence-post (like 4096 byte), and somewhere there's a short circuit on "exact bytes received".

chrisstaite-menlo · 2023-09-07T16:50:47Z

Another fence post?

[edit] should finish reading your message before commenting :)

Stream errors all show current byte when it fails:

24576 = 6 * 4096
36864 = 9 * 4096
32768 = 8 * 4096
20480 = 5 * 4096
45056 = 11 * 4096

allada · 2023-09-07T16:54:54Z

Out of a test theory can you change: DEFAULT_READ_BUFF_SIZE to some weird number like 1001 and retry? Lets see if this theory hold (lowering the number in theory should make it happen more frequent).

chrisstaite-menlo · 2023-09-07T16:56:11Z

For the theory to work DEFAULT_READ_BUFF_SIZE needs to be a multiple of the file I'm downloading or uploading, not just a smaller number, right?

allada · 2023-09-07T16:57:04Z

Yes. But my idea is that there are so many files being uploaded that one of them is bound to be hit.

chrisstaite-menlo · 2023-09-07T16:58:18Z

Unfortunately, I'll have to look at it tomorrow now. I think we're certainly on to something though. It might also explain why the integration tests are sometimes flaky.

allada · 2023-09-07T17:01:13Z

Can you also share your CAS config?

allada · 2023-09-07T17:35:02Z

Oh the red herring... This error message is super miss-leading...

All error messages after: Writer was dropped before EOF was sent is caused because slow_tx gets dropped because fast_tx.send_eof() sent an error and caused slow_tx to be dropped.

https://github.com/allada/turbo-cache/blob/f42f150926c23faba7aa63ba62a40eabb1ce8b20/cas/store/fast_slow_store.rs#L131

I'm very certain GrpcStore::update() is returning an Ok(()), but it's also the one not reading the EOF, so it's looking promising.

chrisstaite · 2023-09-07T17:43:04Z

Si you still need the CAS config?

allada · 2023-09-07T17:44:07Z

Let me see how far I can go, but it'd be nice to have.

chrisstaite-menlo · 2023-09-07T17:45:34Z

{
  "stores": {
    "CAS_MAIN_STORE": {
      "fast_slow": {
        "fast": {
          "memory": {
            "eviction_policy": {
              // 4gb
              "max_bytes": 4000000000
            }
          }
        },
        "slow": {
          "filesystem": {
            "content_path": "/root/.cache/turbo-cache/content_path-cas",
            "temp_path": "/root/.cache/turbo-cache/tmp_path-cas",
            "eviction_policy": {
              // 150gb.
              "max_bytes": 150000000000,
              // 2gb
              "evict_bytes": 2000000000
            }
          }
        }
      }
    },
    "AC_MAIN_STORE": {
      "fast_slow": {
        "fast": {
          "memory": {
            "eviction_policy": {
              // 500mb
              "max_bytes": 500000000
            }
          }
        },
        "slow": {
           "filesystem": {
             "content_path": "/root/.cache/turbo-cache/content_path-cas_ac",
             "temp_path": "/root/.cache/turbo-cache/tmp_path-cas_ac",
             "eviction_policy": {
                // 10gb.
                "max_bytes": 10000000000,
             }
          }
        }
      }
    }
  },
  "servers": [{
    "listen_address": "0.0.0.0:50052",
    "services": {
      "cas": {
        "main": {
          "cas_store": "CAS_MAIN_STORE"
        },
      },
      "ac": {
        "main": {
          "ac_store": "AC_MAIN_STORE"
        }
      },
      "capabilities": {},
      "bytestream": {
        "cas_stores": {
          "main": "CAS_MAIN_STORE",
        },
        // According to https://github.com/grpc/grpc.github.io/issues/371 16KiB - 64KiB is optimal.
        "max_bytes_per_stream": 64000, // 64kb.
      }
    }
  }]
}

allada · 2023-09-07T17:52:32Z

I think I found the problem:
https://github.com/allada/turbo-cache/blob/f42f150926c23faba7aa63ba62a40eabb1ce8b20/cas/grpc_service/bytestream_server.rs#L398

It sees that the exact number of bytes have been received, publishes a EOF to the downstream store, then closes the channel without reading the possible EOF from the GRPC client's store.

I think to fix this we just need to try to read one last message here:
https://github.com/allada/turbo-cache/blob/f42f150926c23faba7aa63ba62a40eabb1ce8b20/cas/store/grpc_store.rs#L306

and if it returns EOF or an error ignore it. If it returns more data, it means it's an error.

chrisstaite · 2023-09-07T17:59:23Z

DropCloserWriteHalf.send_eof should check close_after_size before producing an error, right?

Or the drop for DropCloserWriteHalf should check before sending the error

chrisstaite-menlo · 2023-09-07T18:24:58Z

The only way that error can become set is if stream.next().await is Err, and calling it again is unlikely to make it something other than Err, therefore your proposal is essentially removing the error from GrpcStore.write

chrisstaite-menlo · 2023-09-07T19:17:54Z

The big problem is that inner_upload_file_to_store only returns the Err from the first Future to have one, so we don't actually get the error from the GrpcStore only the reader, which has had it's other half dropped, so it always errors.

chrisstaite-menlo · 2023-09-07T21:01:31Z

I tried adding:

            let data = if data.is_empty() {
                local_state.reader.recv().await.unwrap_or(data)
            } else {
                data
            };

to the unfold in GrpcStore.update. Didn't help. Still getting loads of:

"Receiver went away before receiving EOF : Failed to write eof to slow store in fast_slow store update : Could not upload data to store in upload_file_to_store : for \"/root/.cache/turbo-cache/work/7cadf5b8cacd40c29a4cf5f691fad8d0d3c7f7ffe324aae81ed0cd4aaaa7769e/by/goma/obj/gpu/command_buffer/service/gles2_sources/shared_image_backing_factory.o\" : Error while uploading results"] }

It's super weird. I modified all of the paths to merge errors now and I don't have any errors from any other paths, so I don't understand how the Receiver is being dropped.

resolves #245

chrisstaite-menlo · 2023-09-08T07:33:51Z

Can I suggest we also add this to avoid closing both ends of the pipe in error cases:

diff --git a/cas/store/ac_utils.rs b/cas/store/ac_utils.rs
index cb42067..bf0fb44 100644
--- a/cas/store/ac_utils.rs
+++ b/cas/store/ac_utils.rs
@@ -17,7 +17,7 @@ use std::io::Cursor;
 use std::pin::Pin;
 
 use bytes::BytesMut;
-use futures::{future::try_join, Future, FutureExt, TryFutureExt};
+use futures::{join, Future, FutureExt};
 use prost::Message;
 use sha2::{Digest, Sha256};
 use tokio::io::{AsyncRead, AsyncReadExt};
@@ -111,7 +111,13 @@ fn inner_upload_file_to_store<'a, Fut: Future<Output = Result<(), Error>> + 'a>(
     let upload_file_to_store_fut = cas_store
         .update(digest, rx, UploadSizeInfo::ExactSize(digest.size_bytes as usize))
         .map(|r| r.err_tip(|| "Could not upload data to store in upload_file_to_store"));
-    try_join(read_data_fn(tx), upload_file_to_store_fut).map_ok(|(_, _)| ())
+    let read_data_fut = read_data_fn(tx);
+    async move {
+        // Ensure we get errors reported from both sides
+        let (upload_result, read_result) = join!(upload_file_to_store_fut, read_data_fut);
+        upload_result.merge(read_result)?;
+        Ok(())
+    }
 }
 
 /// Uploads data to our store for given digest.

chrisstaite-menlo · 2023-09-08T07:50:10Z

Confirming that I am not seeing this issue with PR #250 applied.

resolves TraceMachina#245

resolves #245

allada · 2023-09-08T14:01:35Z

Added your suggestion (but modified).

resolves #245

allada added a commit that referenced this issue Sep 8, 2023

Fix bug in BytestreamServer where it would ignore finish_write

a73bc0f

resolves #245

allada mentioned this issue Sep 8, 2023

Fix bug in BytestreamServer where it would ignore finish_write #250

Merged

chrisstaite-menlo pushed a commit to chrisstaite-menlo/turbo-cache that referenced this issue Sep 8, 2023

Fix bug in BytestreamServer where it would ignore finish_write

401d136

resolves TraceMachina#245

allada added a commit that referenced this issue Sep 8, 2023

Fix bug in BytestreamServer where it would ignore finish_write

8a7c6f5

resolves #245

allada closed this as completed in #250 Sep 8, 2023

allada added a commit that referenced this issue Sep 8, 2023

Fix bug in BytestreamServer where it would ignore finish_write

f645d69

resolves #245

Build stall when CAS reset #245

Build stall when CAS reset #245

Comments

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023 • edited Loading

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023 • edited Loading

allada commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

allada commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023 • edited Loading

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

allada commented Sep 7, 2023

allada commented Sep 7, 2023 • edited Loading

chrisstaite commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

allada commented Sep 7, 2023

chrisstaite commented Sep 7, 2023 • edited Loading

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 7, 2023

chrisstaite-menlo commented Sep 8, 2023

chrisstaite-menlo commented Sep 8, 2023

allada commented Sep 8, 2023

chrisstaite-menlo commented Sep 7, 2023 •

edited

Loading

chrisstaite-menlo commented Sep 7, 2023 •

edited

Loading

chrisstaite-menlo commented Sep 7, 2023 •

edited

Loading

allada commented Sep 7, 2023 •

edited

Loading

chrisstaite commented Sep 7, 2023 •

edited

Loading