Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build stall when CAS reset #245

Closed
chrisstaite-menlo opened this issue Sep 7, 2023 · 48 comments · Fixed by #250
Closed

Build stall when CAS reset #245

chrisstaite-menlo opened this issue Sep 7, 2023 · 48 comments · Fixed by #250

Comments

@chrisstaite-menlo
Copy link
Collaborator

I had everything stall during a full Chromium build, it did all come back to life after a minute or so, but the cause appears to be this:

[2023-09-07T07:59:41.326Z WARN  h2::proto::streams::recv] recv_reset; remotely-reset pending-accept streams reached limit (20)
[2023-09-07T07:59:41.326Z ERROR cas] Failed running service : hyper::Error(Http2, Error { kind: GoAway(b"", ENHANCE_YOUR_CALM, Library) })

I'm not sure why this happened and it's the first time I've seen it.

@chrisstaite-menlo
Copy link
Collaborator Author

Getting quite a lot of this error from workers:

[2023-09-07T08:33:14.560Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/3147cfb3677571c42217a593af517bd73cd66990c8fcd08c9453030000000000-37957\" : Could not upload data to store in upload_file_to_store : for \"/root/.cache/turbo-cache/work/621d69738280b8cb78d47e763a09fc5f8166dfd0567b3523ecaa8659147b735c/by/goma/obj/third_party/blink/renderer/bindings/modules/v8/v8/v8_payment_response.o.d\" : Error while uploading results"] }

@allada
Copy link
Member

allada commented Sep 7, 2023

It looks like the CAS was under a lot of load and the scheduler tried to do some lookups but the CAS was unable to process more streams due to some other bottleneck and resulted in the queue filling up, then dropping new connections.

We can mitigate some of this with:
https://doc.servo.org/h2/server/struct.Builder.html#method.max_pending_accept_reset_streams

However, in this case having more than one CAS might be needed in order to deal with excessive loads.

hyperium/h2#668

@chrisstaite-menlo
Copy link
Collaborator Author

It was running at a load average of 12 and had 300% CPU usage. I was trying to avoid having multiple CAS instances because it becomes a pain to manage. I may have to migrate to one of the S3 style storage solutions and host Turbo-Cache to front it. It should all have been serving out of the memory store though, so it's surprising that it couldn't handle it.

@allada
Copy link
Member

allada commented Sep 7, 2023

I wounder if it got caught up in a lock somewhere? It would be great if this happens again if we could get a core dump.

The only N^2 problem that I am aware of in this project is on the scheduler matching, and since this happened on the CAS i'm certain it's a external influenced thing.

Do you happen to know the disk usage? Tokio's disk commands are not actually async, it actually uses blocking calls in a "block safe thread". By default the system is limited to a fixed amount of these blocking threads. So maybe it ended up with too much backpressure from the disk io?

@chrisstaite-menlo
Copy link
Collaborator Author

Didn't appear to have a particularly high wa but that's anecdotal.

@chrisstaite-menlo
Copy link
Collaborator Author

chrisstaite-menlo commented Sep 7, 2023

I'm actually getting a lot of this even when it's not heavily loaded:

[2023-09-07T14:06:14.316Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/06d463ed3e6600423456cdd4faf3c376f042ec06e38b9f002f2f040000000000-10824\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:06.437Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/bd47cdb64c9fcf99f9b92f91ac9e407be88320381cc73a867d60040000000000-9453\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:07.557Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/bdf50ad4b2d7bf49796d02e6f68de97c8c051c134cbb95ab5461040000000000-9838\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:09.229Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e7856deacd8906918a34ef8af8d16a120ba3d45fc678d8f2b262040000000000-9958\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:14.642Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/0b4fff3fa0cd03d3e65189ba9ee94986feb26539ba1fcc3cfd6a040000000000-9197\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:17.001Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/a350640a5de6855a692c7e81fd9b66dd1e2a27dffcccfd44ba70040000000000-12239\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }

I think we may have introduced a problem recently. I just updated the cluster to the latest main.

Only thing in the CAS logs are:

[2023-09-07T14:02:52.560Z ERROR bytestream_server] Write Resp: 2.7540956 None Status { code: Internal, message: "Stream error at byte 8388608 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.561Z ERROR bytestream_server] Write Resp: 8.560164 None Status { code: Internal, message: "Stream error at byte 25165824 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.586Z ERROR bytestream_server] Write Resp: 9.943344 None Status { code: Internal, message: "Stream error at byte 29360128 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.675Z ERROR bytestream_server] Write Resp: 0.034807287 None Status { code: Internal, message: "Stream error at byte 2097152 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.692Z ERROR bytestream_server] Write Resp: 3.1737354 None Status { code: Internal, message: "Stream error at byte 12582912 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.727Z ERROR bytestream_server] Write Resp: 0.07661839 None Status { code: Internal, message: "Stream error at byte 2097152 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.734Z ERROR bytestream_server] Write Resp: 0.0853766 None Status { code: Internal, message: "Stream error at byte 2097152 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:08:35.537Z ERROR bytestream_server] Write Resp: 0.012445173 None Status { code: Internal, message: "Stream error at byte 20480 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:08:35.559Z ERROR bytestream_server] Write Resp: 0.010165742 None Status { code: Internal, message: "Stream error at byte 24576 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:08:40.401Z ERROR bytestream_server] Write Resp: 0.012167023 None Status { code: Internal, message: "Stream error at byte 65536 : In ByteStreamServer::write()", source: None }

@allada
Copy link
Member

allada commented Sep 7, 2023

This would be very useful to get some metrics published under prometheus.

@chrisstaite-menlo
Copy link
Collaborator Author

I'm getting this on the worker:

[2023-09-07T08:32:24.893Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c945f030000000000-0\""] }
[2023-09-07T14:08:35.580Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.618Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.645Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.645Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.655Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.655Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.675Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.676Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.676Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.676Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }

@allada
Copy link
Member

allada commented Sep 7, 2023

Can you try reverting 67b90e2c9a254687b7525053bb4153f95e216b9d and try?

That commit was significant and is the only major change I can think of that might cause this.

@chrisstaite-menlo
Copy link
Collaborator Author

Just doing a build with that reverted now.

@chrisstaite-menlo
Copy link
Collaborator Author

Now running with that build and it appears a lot happier, although it's only been up for a few seconds, so I'll try and exercise it to check.

@chrisstaite-menlo
Copy link
Collaborator Author

I've just hit it with a totally new branch and it's flying along without any issues. The CAS is running at a load average of 18 and rising and 300% CPU with a 2.8 wa but there's no stalls or breakages.

Looks like there might be an issue with 67b90e2.

@allada
Copy link
Member

allada commented Sep 7, 2023

Dang, I will revert that commit. I may need to find another way to do what it's trying to do. I tried a few strategies, but this was the one i hated the least.

@chrisstaite-menlo
Copy link
Collaborator Author

I spoke too soon, getting stalls again. When it stalls the scheduler and remoteexec_proxy CPU usage drops to a tiny amount and get these:

[2023-09-07T14:54:23.990Z WARN  simple_scheduler] Internal error for worker 46ecd873-ab4f-4683-958f-c9b5b1db72b1: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/fcbb86c63a3e9f318ef3e082d7cb809dcc9df658dd9b22a71477000000000000-10778 : Could not upload data to store in upload_to_store : Error while uploading results"] }
[2023-09-07T14:54:25.290Z WARN  simple_scheduler] Internal error for worker 46ecd873-ab4f-4683-958f-c9b5b1db72b1: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/fe55b3977c64b4d0408e121f3e62406afb2258b73fbe6e192277000000000000-10888 : Could not upload data to store in upload_to_store : Error while uploading results"] }
[2023-09-07T14:54:27.344Z WARN  simple_scheduler] Internal error for worker dbfeef84-d1ad-4f6f-9f1e-9681195f2e35: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/0d1a3dcffaedfe4a96e4ab48a6b97b627e407045c8393d63a378000000000000-8344 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/7403455e0ecaa6f782afa837ca22e09c11802d6424b144758d4dbec5de1d36f5/by/goma/obj/components/safety_check/safety_check/update_check_helper.o\" : Error while uploading results"] }
[2023-09-07T14:54:36.713Z WARN  simple_scheduler] Internal error for worker 46ecd873-ab4f-4683-958f-c9b5b1db72b1: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/faef7fb3995348a490c099684ef8ff48c5a4d21a44cbaaf55079000000000000-8988 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/a9ef3a5608d364d407ed72a4542393a99aba5bdf327764eca13b7ba4a7f861ad/by/goma/obj/components/security_interstitials/core/core/bad_clock_ui.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:37.671Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/6e6561700bbfbd48a75972966e5bf527a0f7cea14f3ff50cdf77000000000000-9735 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/ac584fe072761ea0a2a91e8f5a0b3a203e9db4880927581c3ce4cf405ef0d7ab/by/goma/obj/components/security_interstitials/core/core/safe_browsing_quiet_error_ui.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:37.976Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/64f350e0acf2449906c1e7dea239a9fb5600e73b8a88f2e2fa77000000000000-8368 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/428835674a916f53f6e8b1fdde29879a719d8a6bb3216f5c954e9a57145861b6/by/goma/obj/components/security_interstitials/core/core/controller_client.o\" : Error while uploading results"] }
[2023-09-07T14:54:38.450Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/148722dcd9a71441c5783525482540187c087ae9d17246f21278000000000000-10085 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/6f35c0b9223d00b71422bec7c5a8a1a585f9bcd44a840064b25871c762071605/by/goma/obj/components/security_interstitials/core/core/https_only_mode_enforcelist.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:38.509Z WARN  simple_scheduler] Internal error for worker dbfeef84-d1ad-4f6f-9f1e-9681195f2e35: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/346bbdc66eeff2390a88ac5a1707ef7195645ef122d9c366bc7a000000000000-11609 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/a03735e7d1e744370ec03f9c39ad99fa6390ce6d607efca2d68a7fa31682cee7/by/goma/obj/components/segmentation_platform/internal/internal/field_trial_recorder.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:39.107Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/9c060331fbf2cf65e497a425ff6f33bde624f64b8fbe41d84178000000000000-11684 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/5cff60d775f277c693c2ad6a7b4e1f39207dcd97ed67beb3a2adb273315382d9/by/goma/obj/components/segmentation_platform/internal/internal/metadata_utils.o.d\" : Error while uploading results"] }

@chrisstaite-menlo
Copy link
Collaborator Author

I think before I got these issues I was running a really old version back at around 259224b

@chrisstaite-menlo
Copy link
Collaborator Author

chrisstaite-menlo commented Sep 7, 2023

I don't suppose it could be 8ac4824? The only other option is b0b66a1 I think.

@allada
Copy link
Member

allada commented Sep 7, 2023

Hmmm, would it be too much work to bisect it?

What hash where you running before attempting to upgrade?

@allada
Copy link
Member

allada commented Sep 7, 2023

We might be conflating two issues.

The failed to upload stuff looks like it's failing a soft assert:

[WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/06d463ed3e6600423456cdd4faf3c376f042ec06e38b9f002f2f040000000000-10824\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }

Triggered in:
https://github.com/allada/turbo-cache/blob/main/cas/scheduler/simple_scheduler.rs#L468

I could see a couple ways this might happen, but none of them should be critical.

Does this actually fail the build or just put noise in the logs?

@chrisstaite-menlo
Copy link
Collaborator Author

The build doesn't fail it gets a RETRY entry in the Goma logs and stalls everything, it's like it locks up. I'm currently trying with just 8ac4824 reverted to see what I get.

@chrisstaite-menlo
Copy link
Collaborator Author

Reverting that change does appear to have fixed it. And I think I have a reason why, previously when 128 simultaneous builds produced their outputs they only generated 128 concurrent uploads of results to the CAS. However, now they upload the stdout and stderr too, so they have 384 simultaneous uploads. Combine that with the downloads and you can see how the limit might be reached on the CAS. This is simply a scale thing I think.

@chrisstaite-menlo
Copy link
Collaborator Author

Having said that, there are still four RETRY entries in Goma for this build, so perhaps it's still happening sometimes?

@chrisstaite-menlo
Copy link
Collaborator Author

I think that these logs in the CAS indicate that an issue occurred:

[2023-09-07T15:57:36.444Z ERROR bytestream_server] Write Resp: 0.16644613 None Status { code: Internal, message: "Stream error at byte 24576 : In ByteStreamServer::write()", source: None }
[2023-09-07T15:58:09.957Z ERROR bytestream_server] Write Resp: 0.15555136 None Status { code: Internal, message: "Stream error at byte 36864 : In ByteStreamServer::write()", source: None }
[2023-09-07T15:58:26.965Z ERROR bytestream_server] Write Resp: 0.004299194 None Status { code: Internal, message: "Stream error at byte 32768 : In ByteStreamServer::write()", source: None }
[2023-09-07T16:02:00.397Z ERROR bytestream_server] Write Resp: 0.015137762 None Status { code: Internal, message: "Stream error at byte 20480 : In ByteStreamServer::write()", source: None }
[2023-09-07T16:02:07.979Z ERROR bytestream_server] Write Resp: 0.024254864 None Status { code: Internal, message: "Stream error at byte 45056 : In ByteStreamServer::write()", source: None }

Although they don't really explain what that issue was...

@allada
Copy link
Member

allada commented Sep 7, 2023

In looking into this, it's super strange, the reader and writer are both complaining about each other:

--> Receiver went away before receiving EOF ...
--> Writer was dropped before EOF was sent ...

This is super confusing.

@chrisstaite-menlo
Copy link
Collaborator Author

Should there just be a retry in there perhaps?

@allada
Copy link
Member

allada commented Sep 7, 2023

I guess we could. I'd rather figure out the root cause though, because it might point to a more fundamental issue.

I'm still investigating what might be going on... The entire reason I creates .err_tip() was because buildbarn had the same issue with not giving much hints into where the problems where for this stuff, and .err_tip() was supposed to help... but I guess it's not enough :-(

@allada
Copy link
Member

allada commented Sep 7, 2023

Can you share your worker config file?

@chrisstaite-menlo
Copy link
Collaborator Author

{
  "stores": {
    "GRPC_LOCAL_STORE": {
      // Note: This file is used to test GRPC store.
      "grpc": {
        "instance_name": "main",
        "endpoints": ["grpc://${CAS_ENDPOINT:-127.0.0.1:50051}"],
        "store_type": "CAS"
      }
    },
    "GRPC_LOCAL_AC_STORE": {
      // Note: This file is used to test GRPC store.
      "grpc": {
        "instance_name": "main",
        "endpoints": ["grpc://${CAS_ENDPOINT:-127.0.0.1:50051}"],
        "store_type": "AC"
      }
    },
    "WORKER_FAST_SLOW_STORE": {
      "fast_slow": {
        "fast": {
          "filesystem": {
            "content_path": "/root/.cache/turbo-cache/data-worker-test/content_path-cas",
            "temp_path": "/root/.cache/turbo-cache/data-worker-test/tmp_path-cas",
            "eviction_policy": {
              // 1gb.
              "max_bytes": 1000000000,
            }
          }
        },
        "slow": {
          "ref_store": {
            "name": "GRPC_LOCAL_STORE",
          }
        }
      }
    }
  },
  "workers": [{
    "local": {
      "worker_api_endpoint": {
        "uri": "grpc://${SCHEDULER_ENDPOINT:-127.0.0.1:50061}",
      },
      "cas_fast_slow_store": "WORKER_FAST_SLOW_STORE",
      "ac_store": "GRPC_LOCAL_AC_STORE",
      "work_directory": "/root/.cache/turbo-cache/work",
      "platform_properties": {
        "OSFamily": {
          "values": ["Linux"],
        },
        "cpu_count": {
          "query_cmd": "nproc"
        },
      },
      "precondition_script": "/root/precondition_script.sh"
    }
  }],
  "servers": []
}

@allada
Copy link
Member

allada commented Sep 7, 2023

Well, found one bug. I'm pretty sure it is not the root cause of anything, but it's still wrong (fast_rx is going to slow store and vice versa):
https://github.com/allada/turbo-cache/blob/main/cas/store/fast_slow_store.rs#L161

@chrisstaite-menlo
Copy link
Collaborator Author

That would explain why the log says it's populating the fast store but it's remote...

@allada
Copy link
Member

allada commented Sep 7, 2023

I'm pretty sure what is happening is that the GRPC store (or somewhere down stream) is getting the exact number of bytes needed, then closing the stream before the sender is able to send an EOF.

In other words, somewhere down stream says: "oh, i got all the data you are going to send, i might as well close the connection", even though there's a zero byte message that is required to be sent (the EOF message).

If this is the case, I'd bet money that it's on the fence-post (like 4096 byte), and somewhere there's a short circuit on "exact bytes received".

@chrisstaite-menlo
Copy link
Collaborator Author

chrisstaite-menlo commented Sep 7, 2023

Another fence post?

[edit] should finish reading your message before commenting :)

Stream errors all show current byte when it fails:

24576 = 6 * 4096
36864 = 9 * 4096
32768 = 8 * 4096
20480 = 5 * 4096
45056 = 11 * 4096

@allada
Copy link
Member

allada commented Sep 7, 2023

Out of a test theory can you change: DEFAULT_READ_BUFF_SIZE to some weird number like 1001 and retry? Lets see if this theory hold (lowering the number in theory should make it happen more frequent).

@chrisstaite-menlo
Copy link
Collaborator Author

For the theory to work DEFAULT_READ_BUFF_SIZE needs to be a multiple of the file I'm downloading or uploading, not just a smaller number, right?

@allada
Copy link
Member

allada commented Sep 7, 2023

Yes. But my idea is that there are so many files being uploaded that one of them is bound to be hit.

@chrisstaite-menlo
Copy link
Collaborator Author

Unfortunately, I'll have to look at it tomorrow now. I think we're certainly on to something though. It might also explain why the integration tests are sometimes flaky.

@allada
Copy link
Member

allada commented Sep 7, 2023

Can you also share your CAS config?

@allada
Copy link
Member

allada commented Sep 7, 2023

Oh the red herring... This error message is super miss-leading...

All error messages after: Writer was dropped before EOF was sent is caused because slow_tx gets dropped because fast_tx.send_eof() sent an error and caused slow_tx to be dropped.

https://github.com/allada/turbo-cache/blob/f42f150926c23faba7aa63ba62a40eabb1ce8b20/cas/store/fast_slow_store.rs#L131

I'm very certain GrpcStore::update() is returning an Ok(()), but it's also the one not reading the EOF, so it's looking promising.

@chrisstaite
Copy link
Contributor

Si you still need the CAS config?

@allada
Copy link
Member

allada commented Sep 7, 2023

Let me see how far I can go, but it'd be nice to have.

@chrisstaite-menlo
Copy link
Collaborator Author

{
  "stores": {
    "CAS_MAIN_STORE": {
      "fast_slow": {
        "fast": {
          "memory": {
            "eviction_policy": {
              // 4gb
              "max_bytes": 4000000000
            }
          }
        },
        "slow": {
          "filesystem": {
            "content_path": "/root/.cache/turbo-cache/content_path-cas",
            "temp_path": "/root/.cache/turbo-cache/tmp_path-cas",
            "eviction_policy": {
              // 150gb.
              "max_bytes": 150000000000,
              // 2gb
              "evict_bytes": 2000000000
            }
          }
        }
      }
    },
    "AC_MAIN_STORE": {
      "fast_slow": {
        "fast": {
          "memory": {
            "eviction_policy": {
              // 500mb
              "max_bytes": 500000000
            }
          }
        },
        "slow": {
           "filesystem": {
             "content_path": "/root/.cache/turbo-cache/content_path-cas_ac",
             "temp_path": "/root/.cache/turbo-cache/tmp_path-cas_ac",
             "eviction_policy": {
                // 10gb.
                "max_bytes": 10000000000,
             }
          }
        }
      }
    }
  },
  "servers": [{
    "listen_address": "0.0.0.0:50052",
    "services": {
      "cas": {
        "main": {
          "cas_store": "CAS_MAIN_STORE"
        },
      },
      "ac": {
        "main": {
          "ac_store": "AC_MAIN_STORE"
        }
      },
      "capabilities": {},
      "bytestream": {
        "cas_stores": {
          "main": "CAS_MAIN_STORE",
        },
        // According to https://github.com/grpc/grpc.github.io/issues/371 16KiB - 64KiB is optimal.
        "max_bytes_per_stream": 64000, // 64kb.
      }
    }
  }]
}

@allada
Copy link
Member

allada commented Sep 7, 2023

I think I found the problem:
https://github.com/allada/turbo-cache/blob/f42f150926c23faba7aa63ba62a40eabb1ce8b20/cas/grpc_service/bytestream_server.rs#L398

It sees that the exact number of bytes have been received, publishes a EOF to the downstream store, then closes the channel without reading the possible EOF from the GRPC client's store.

I think to fix this we just need to try to read one last message here:
https://github.com/allada/turbo-cache/blob/f42f150926c23faba7aa63ba62a40eabb1ce8b20/cas/store/grpc_store.rs#L306

and if it returns EOF or an error ignore it. If it returns more data, it means it's an error.

@chrisstaite
Copy link
Contributor

chrisstaite commented Sep 7, 2023

DropCloserWriteHalf.send_eof should check close_after_size before producing an error, right?

Or the drop for DropCloserWriteHalf should check before sending the error

@chrisstaite-menlo
Copy link
Collaborator Author

The only way that error can become set is if stream.next().await is Err, and calling it again is unlikely to make it something other than Err, therefore your proposal is essentially removing the error from GrpcStore.write

@chrisstaite-menlo
Copy link
Collaborator Author

The big problem is that inner_upload_file_to_store only returns the Err from the first Future to have one, so we don't actually get the error from the GrpcStore only the reader, which has had it's other half dropped, so it always errors.

@chrisstaite-menlo
Copy link
Collaborator Author

I tried adding:

            let data = if data.is_empty() {
                local_state.reader.recv().await.unwrap_or(data)
            } else {
                data
            };

to the unfold in GrpcStore.update. Didn't help. Still getting loads of:

"Receiver went away before receiving EOF : Failed to write eof to slow store in fast_slow store update : Could not upload data to store in upload_file_to_store : for \"/root/.cache/turbo-cache/work/7cadf5b8cacd40c29a4cf5f691fad8d0d3c7f7ffe324aae81ed0cd4aaaa7769e/by/goma/obj/gpu/command_buffer/service/gles2_sources/shared_image_backing_factory.o\" : Error while uploading results"] }

It's super weird. I modified all of the paths to merge errors now and I don't have any errors from any other paths, so I don't understand how the Receiver is being dropped.

@chrisstaite-menlo
Copy link
Collaborator Author

Can I suggest we also add this to avoid closing both ends of the pipe in error cases:

diff --git a/cas/store/ac_utils.rs b/cas/store/ac_utils.rs
index cb42067..bf0fb44 100644
--- a/cas/store/ac_utils.rs
+++ b/cas/store/ac_utils.rs
@@ -17,7 +17,7 @@ use std::io::Cursor;
 use std::pin::Pin;
 
 use bytes::BytesMut;
-use futures::{future::try_join, Future, FutureExt, TryFutureExt};
+use futures::{join, Future, FutureExt};
 use prost::Message;
 use sha2::{Digest, Sha256};
 use tokio::io::{AsyncRead, AsyncReadExt};
@@ -111,7 +111,13 @@ fn inner_upload_file_to_store<'a, Fut: Future<Output = Result<(), Error>> + 'a>(
     let upload_file_to_store_fut = cas_store
         .update(digest, rx, UploadSizeInfo::ExactSize(digest.size_bytes as usize))
         .map(|r| r.err_tip(|| "Could not upload data to store in upload_file_to_store"));
-    try_join(read_data_fn(tx), upload_file_to_store_fut).map_ok(|(_, _)| ())
+    let read_data_fut = read_data_fn(tx);
+    async move {
+        // Ensure we get errors reported from both sides
+        let (upload_result, read_result) = join!(upload_file_to_store_fut, read_data_fut);
+        upload_result.merge(read_result)?;
+        Ok(())
+    }
 }
 
 /// Uploads data to our store for given digest.

@chrisstaite-menlo
Copy link
Collaborator Author

Confirming that I am not seeing this issue with PR #250 applied.

chrisstaite-menlo pushed a commit to chrisstaite-menlo/turbo-cache that referenced this issue Sep 8, 2023
@allada
Copy link
Member

allada commented Sep 8, 2023

Added your suggestion (but modified).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants