-
-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stroller prototype #478
Stroller prototype #478
Conversation
Oh, re the choice of |
i'm a little worried about the maintenance burden of having to write all our API's json packets in rust, vs ocaml which we're all going to have to know anyway. I'd love to hear your thoughts on why pulling from the DB (and then constructing the json packets in rust) rather than just doing pure forwarding of packets from ocaml to pusher. |
Hmm, to make sure I'm understanding the concern correctly - you want to avoid the format for the JSON that gets pushed to the client being specified in Rust? Right now that format is literally just "whatever's in the That said I could imagine us wanting to augment that (e.g. to send a whole trace rather than just an individual event as it does now, something like It'd certainly be feasible to have stroller just pass along whatever JSON it received in a
What do you think? I didn't burn many cycles on the database parts and it wouldn't be hard to strip them out again. |
Going to reread this a few times and have a think, but let me add one more complication: often a single trace is over 10k. I see some 50k traces, and we once had a 5MB trace. |
One more thing:
|
Yes, correct. I'm worried about the engineering overhead of adding new data to be sent to the client. It's already a bit expensive to specify json in different ways on the frontend and the backend (working on it!), and this seems another place.
Right, this is exactly what I'm thinking of.
Agree it might be cheaper. Do you have a sense of the cost? I'm guessing sub 1ms , and if it's more than that then that's not good.
I do see this argument too - perhaps we'll need to cut up things in a special format or something. My current thinking is to perhaps operate strictly as a pass-through proxy until performance tells us otherwise. But very interested in your thoughts on the matter. |
bf4ba25
to
c74e1d4
Compare
Any idea whether >10k happens 5% of the time or 50%, and what's the main cause when it happens - is it HTTP events with giant unwieldy headers, deeply nested function call graphs, users sending ridiculous form arguments, etc? If large traces are an edge case, then I'd be inclined to push ahead (pun intended) with Pusher for the moment, accepting that large traces wouldn't get pushed to the editor for the initial version. Two ways to mitigate:
If large traces are common enough that not pushing them would hurt the utility of this feature, maybe we could mitigate by splitting them up (e.g. sending one function argument/result pair at a time)? |
Yup, I'd assumed that stored_events would be high in volume. OCaml will always post to stroller and stroller will always do something:
|
After some thought, I agree with your conclusion. Besides keeping the JSON format definition out of Rust, it also simplifies the architecture, allowing stroller to be a bit more of a black box. That said, some thoughts on "until performance tells us otherwise":
I don't have a good sense of how efficient OCaml's HTTP client library is at writing bytes, or of the distribution of trace sizes. Assuming Cohttp is decent and 10k traces, 1ms per POST is probably reasonable (loopback should be much faster than 10MB/s). For 1MB traces on the other hand I'd want to test it... The biggest concern I'd have is that the single-threaded OCaml processes will now be making one of these POSTs for every end-user request for every Dark application - vs before when it was just sending trace data once per editor poll. (Again I'm not sure what end-user request volumes we're seeing right now, but presumably end-user requests should eventually become more frequent than the editor poll frequency.) Do you have a sense for numbers for any of the above (trace size distributions, end-user request volumes per app)? My guess is this is more a scaling concern rather than a "right now" concern, and that even POSTing large traces to stroller will be an improvement relative to the current polling approach. |
Much more like 5%, probably lower. Most of the values being sent down are relatively small, and the things I've observed are:
We clearly need to allow lazy or partial results to be sent to the client but we dont have anything like that now. However, I'm not sure I'd call them an edge case too much. They do occur frequently, it's not so much 3% of users as 100% of users, 3% of the time. But when it fails, it fails hard and disables the core feature of the platform.
I very much support testing it!
Yes, I agree. Right now, we are polling one HTTP handler per second, and have the option to put timestamps in to check it.
If we assume a developer is coding 6hrs a day, then we poll 6 * 60 * 60 times per day per coder. So do 1-person coding teams receive more than 21k req/day? Some will. Also, we will need to think about the number of events we actually save - we're not going to always be saving everything.
Traces are often 1k and almost always <100k. Here's a random selection. If you go to https://darklang.com/a/listo, dont click on anything, and go into the "Network" tab in chrome dev tools, you'll see us fetching traces every second, each time for a random toplevel. Note that the current situation has lots of low hanging fruit (not least of which is "fetch only never than this timestamp"). |
If you're thinking "maybe this whole thing isn't the correct solution to this problem", my current thinking is that having a sidecar proxy solves a large technical problem in our codebase (how to send any non-blocking http call, such as rollbar or any other data from within the ocaml process). It creates a useful tool to allow us optimize the analysis, and even if we end up doing something like moving analysis to another service, we're still going to need a proxy sidecar to send data to it. I expect it's an important step in enabling websockets in the product, as well as graphql. |
Thanks for posting the numbers (and clarification that it's 100% of users 3% of the time vs vice versa). That definitely helps.
Yup, I definitely think the sidecar is the right model, sorry if I came across as suggesting otherwise. I was just wanting to flesh out the costs of sidecar-as-passthrough-proxy vs sidecar-as-semi-intelligent-background-worker (as the current state of this PR). I think you're right that passthrough proxy makes better sense - both for keeping schema definitions out of Rust and for migrating additional calls like Rollbar later on. Given the current request volume I think performance of POSTing the JSON will be adequate - if it turns out to be a problem maybe we can mitigate by doing the POST in the background, e.g. with I'll make the changes to pull out the DB calls and pass through the JSON, and push to this PR. (I'll also rebase onto master to fix the ocaml errors in the CI build.) Once I've got to grips with the backend codebase and added the call to stroller, I can do some smoke tests for large traces to see what the performance is like. |
Re the 10kB limit. I see Pusher only as a stopgap to get something working, and it sounds like the message size limit will be the biggest reason we'll want to discard it in favour of websockets. Fortunately there's not much work specific to Pusher here - I think most of the work left to be done is 90% the same regardless of Pusher vs websocket:
I think it's still worth shipping the Pusher version as v1, under a flag, so we can test it out internally (knowing there's a trace size limitation). Then ditching Pusher and switching to real websockets for v2 will require a little additional work (which using Pusher allowed us to skip), but minimal rework:
The one piece of remaining work that is Pusher-specific is the wrapper for their REST API discussed above (to work around the lack of HTTPS support in the unofficial Rust library). I could skip that if we're okay with accepting plaintext traffic (e.g. for internal, non-sensitive apps only) for the stopgap period before v2. |
…t event and push it
c74e1d4
to
ec7fd1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rust is looking good so far! Couple of comments about next steps, but no changes requested.
|
||
// make actual Pusher call in the background without blocking the caller | ||
// (since that's the whole point of having this in a separate process) | ||
thread::spawn(move || { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably use a threadpool in the n+1 version,?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup - that would work around the Pusher client not being sharable between threads. This current approach is pretty inefficient, since we're making a new HTTP (and TCP) connection to Pusher every time we want to push something, although doing so in a background thread hides that latency from the backend.
That said, I'm pretty sure the n+1 version will be dropping Pusher anyway, and instead we'll just have the map of open client connections, with tokio handling concurrency.
|
||
impl Push for push::Client { | ||
fn push(json_bytes: &[u8]) -> Result<(), String> { | ||
// TODO reuse push client! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, i'm imagining each thread having a thread-local pusher connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, though see above. In the websocket world each thread will keep track of which websocket connection it's pushing to. (That will take some concurrency control, to avoid interleaving packets if two different POSTs come in for the same canvas - e.g. by having all the connections owned by a single "sender" thread onto which the request threads spawn futures.)
Further design discussion via phone with @pbiggar and @IanConnolly - recording outcome here for posterity: We decided to reduce the scope slightly. Instead of pushing entire traces to the client, we'll just use push to notify the client when new traces are available, leaving retrieval of the full trace details to the existing "pull" path for now. This way OCaml only has to push lists of trace ids and tlids on every end-user request, instead of entire traces. This should still be enough to allow us to disable the 1-second poll, and avoids some potential issues:
Since the last two points mitigate the main issues with Pusher (vs websockets), we'll plan to stick with Pusher after all for now, so I'll revive the work to replace the Pusher Rust library and provide TLS support. @pbiggar @IanConnolly I believe implementing the above decisions mostly impacts my changes to It still needs changes to route messages to the right canvas, and to swap out the Pusher credentials for a Dark-owned Pusher account rather than my free test account, but I plan to address those in future PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! 🚢 @samstokes
This is a functioning sketch of how the stroller logic will look. The rough idea is that:
POST
request to stroller at/canvas/:canvas_uuid/trace/:trace_uuid/events
In the interests of getting to the interesting part, this PR takes a few steps at once:
POST
requests.Still to do
Besides the obvious (actually making the backend call stroller, and the editor subscribe to Pusher), a few things are missing or could be improved. Given that nothing calls this yet, though, I propose they can wait until subsequent PRs.
pusher-http-rust
doesn't support TLS (! - see below). I'm guessing this would be a deal breaker for production use, since it would mean sending end-user data in plaintext across the public Internet.Pusher client
This gets its own section because it's a mess. I started off by using the
pusher
crate - it's under thepusher-community
Github org but seemed at least semi-official. Unfortunately it turns out to be basically unmaintained - the last substantial commit is over 18 months old, and it depends on some old libraries. Among other issues:hyper
, the same HTTP crate I'm using for the stroller web server, but an older version that relies on a pre-release version oftokio
hyper
is too old to be compatible withhyper-tls
, which seems to be the more supported of the twohyper
plugins adding TLS support (hence the unencrypted connection noted above)I briefly looked into forking it to bring it up to date with current
hyper
, but thehyper
API changes are significant and that didn't look like a good investment of time.Instead we should probably just ditch this crate and write our own wrapper for the Pusher HTTP API, which is well documented and not very complicated. Unless we decide this is reason enough to ditch Pusher and deal with long-lived HTTP connections directly to stroller, I plan to work on that next.