Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stroller prototype #478

Merged
merged 20 commits into from
Dec 14, 2018
Merged

Stroller prototype #478

merged 20 commits into from
Dec 14, 2018

Conversation

samstokes
Copy link
Contributor

This is a functioning sketch of how the stroller logic will look. The rough idea is that:

  • when the server has stored a new trace
  • it will make a POST request to stroller at /canvas/:canvas_uuid/trace/:trace_uuid/events
  • that tells stroller to look up the stored events and function arguments/results for the specified canvas and trace, and send them (via Pusher) to the specified canvas

In the interests of getting to the interesting part, this PR takes a few steps at once:

  • adds Postgres support (via diesel) for accessing trace data
  • integrates Pusher (via pusher-http-rust - see below)
  • sets up the HTTP server to listen for POST requests.

Still to do

Besides the obvious (actually making the backend call stroller, and the editor subscribe to Pusher), a few things are missing or could be improved. Given that nothing calls this yet, though, I propose they can wait until subsequent PRs.

  • stroller's connection to the Pusher servers is unencrypted, because pusher-http-rust doesn't support TLS (! - see below). I'm guessing this would be a deal breaker for production use, since it would mean sending end-user data in plaintext across the public Internet.
  • this currently just pushes to a test Pusher channel - in production this will need to route the Pusher message to the appropriate canvas (e.g. by using the canvas UUID as the channel name)
    • do we currently support concurrent editing of the same canvas by multiple users (or the same user in multiple browser tabs)?
  • this currently uses Pusher credentials from a test account I signed up for - once this runs anywhere other than my laptop we'll presumably want prod and dev accounts with credentials in Dark's preferred credential sharing service.

Pusher client

This gets its own section because it's a mess. I started off by using the pusher crate - it's under the pusher-community Github org but seemed at least semi-official. Unfortunately it turns out to be basically unmaintained - the last substantial commit is over 18 months old, and it depends on some old libraries. Among other issues:

  • it depends on hyper, the same HTTP crate I'm using for the stroller web server, but an older version that relies on a pre-release version of tokio
  • that older version of hyper is too old to be compatible with hyper-tls, which seems to be the more supported of the two hyper plugins adding TLS support (hence the unencrypted connection noted above)

I briefly looked into forking it to bring it up to date with current hyper, but the hyper API changes are significant and that didn't look like a good investment of time.

Instead we should probably just ditch this crate and write our own wrapper for the Pusher HTTP API, which is well documented and not very complicated. Unless we decide this is reason enough to ditch Pusher and deal with long-lived HTTP connections directly to stroller, I plan to work on that next.

@samstokes
Copy link
Contributor Author

Oh, re the choice of diesel over raw rust-postgres. This is mainly because I'm familiar with it from a side project, and partly out of personal preference for query-builder APIs over writing raw SQL. Definitely willing to revisit if you're not a fan (or if Dark in general prefers to avoid ORM-like libraries).

@pbiggar
Copy link
Member

pbiggar commented Dec 7, 2018

i'm a little worried about the maintenance burden of having to write all our API's json packets in rust, vs ocaml which we're all going to have to know anyway.

I'd love to hear your thoughts on why pulling from the DB (and then constructing the json packets in rust) rather than just doing pure forwarding of packets from ocaml to pusher.

@samstokes
Copy link
Contributor Author

i'm a little worried about the maintenance burden of having to write all our API's json packets in rust, vs ocaml which we're all going to have to know anyway.

I'd love to hear your thoughts on why pulling from the DB (and then constructing the json packets in rust) rather than just doing pure forwarding of packets from ocaml to pusher.

Hmm, to make sure I'm understanding the concern correctly - you want to avoid the format for the JSON that gets pushed to the client being specified in Rust? Right now that format is literally just "whatever's in the value column of the stored_events_v2 table", so you don't need to know any Rust to update that format. (i.e. right now "constructing the JSON packets in Rust" is literally just passing through the value field. The "message" field you can see in the code is part of the Pusher protocol, not part of the format the editor would consume.)

That said I could imagine us wanting to augment that (e.g. to send a whole trace rather than just an individual event as it does now, something like {"events": [...], "function_calls": [...]}; or to mask certain fields from the value JSON), and at that point there would be a format that would need specifying, and I see the concern there.

It'd certainly be feasible to have stroller just pass along whatever JSON it received in a POST. The main reasons I did it this way:

  • it seemed like the cheapest thing perf-wise - OCaml just has to make an empty POST, rather than writing out a JSON blob that it already wrote once to Postgres. (That said, Pusher caps out at 10k per message anyway unless we get special privileges, so maybe it's premature to care about large JSON blobs.)
  • I thought this might make it easier to get all the bits of a given trace at once - events, function arguments and function results - because stroller knows just enough about the DB layout to assemble a trace from its component parts. OCaml can just tell stroller "I just saved trace aabbcc-123" and stroller can grab all the bits. Could also totally see the argument for keeping stroller away from the DB and just being a passthrough.

What do you think? I didn't burn many cycles on the database parts and it wouldn't be hard to strip them out again.

@pbiggar
Copy link
Member

pbiggar commented Dec 7, 2018

Going to reread this a few times and have a think, but let me add one more complication: often a single trace is over 10k. I see some 50k traces, and we once had a 5MB trace.

@pbiggar
Copy link
Member

pbiggar commented Dec 7, 2018

One more thing:

  • the number of stored_events coming in is going to be much much larger than the number of people with their editor open. Will stroller be sending the data unconditionally (regardless of whether a user is connected)? If so than this constraint doesn't matter.

@pbiggar
Copy link
Member

pbiggar commented Dec 7, 2018

Hmm, to make sure I'm understanding the concern correctly - you want to avoid the format for the JSON that gets pushed to the client being specified in Rust?

Yes, correct. I'm worried about the engineering overhead of adding new data to be sent to the client. It's already a bit expensive to specify json in different ways on the frontend and the backend (working on it!), and this seems another place.

That said I could imagine us wanting to augment that (e.g. to send a whole trace rather than just an individual event as it does now, something like {"events": [...], "function_calls": [...]}; or to mask certain fields from the value JSON), and at that point there would be a format that would need specifying, and I see the concern there.

Right, this is exactly what I'm thinking of.

it seemed like the cheapest thing perf-wise - OCaml just has to make an empty POST, rather than writing out a JSON blob that it already wrote once to Postgres. (That said, Pusher caps out at 10k per message anyway unless we get special privileges, so maybe it's premature to care about large JSON blobs.)

Agree it might be cheaper. Do you have a sense of the cost? I'm guessing sub 1ms , and if it's more than that then that's not good.

I thought this might make it easier to get all the bits of a given trace at once - events, function arguments and function results - because stroller knows just enough about the DB layout to assemble a trace from its component parts. OCaml can just tell stroller "I just saved trace aabbcc-123" and stroller can grab all the bits. Could also totally see the argument for keeping stroller away from the DB and just being a passthrough.

I do see this argument too - perhaps we'll need to cut up things in a special format or something.

My current thinking is to perhaps operate strictly as a pass-through proxy until performance tells us otherwise. But very interested in your thoughts on the matter.

@samstokes
Copy link
Contributor Author

Going to reread this a few times and have a think, but let me add one more complication: often a single trace is over 10k. I see some 50k traces, and we once had a 5MB trace.

Any idea whether >10k happens 5% of the time or 50%, and what's the main cause when it happens - is it HTTP events with giant unwieldy headers, deeply nested function call graphs, users sending ridiculous form arguments, etc?

If large traces are an edge case, then I'd be inclined to push ahead (pun intended) with Pusher for the moment, accepting that large traces wouldn't get pushed to the editor for the initial version. Two ways to mitigate:

  • just move forward with getting rid of Pusher in favour of real websockets ASAP after validating this approach
  • have a less-frequent "catchup" poll (e.g. with a "here are the traces I already have" mechanism) - we might want this anyway, since neither Pusher nor websockets would 100% guarantee delivery of every trace (see musings here)

If large traces are common enough that not pushing them would hurt the utility of this feature, maybe we could mitigate by splitting them up (e.g. sending one function argument/result pair at a time)?

@samstokes
Copy link
Contributor Author

the number of stored_events coming in is going to be much much larger than the number of people with their editor open. Will stroller be sending the data unconditionally (regardless of whether a user is connected)? If so than this constraint doesn't matter.

Yup, I'd assumed that stored_events would be high in volume. OCaml will always post to stroller and stroller will always do something:

  • in the Pusher case, I was assuming stroller would always push traces to Pusher, leaving it up to Pusher to decide whether to actually deliver the traces to anyone. That does mean we'd be consuming our Pusher events/day quota even when nobody was editing. I don't know what sort of end-user request volume we're seeing, but Pusher's paid plans start at 1 million events/day which I'm guessing is good enough for validation?
    • if that's a problem, it won't be hard to query Pusher first to check if anyone is subscribed, and skip pushing the trace if not (although that introduces a race condition which would increase the best-effort-ness of the delivery)
  • once we go to websockets, stroller will have a map of open connections, so it'll simply not push anything if nobody's listening.

@samstokes
Copy link
Contributor Author

My current thinking is to perhaps operate strictly as a pass-through proxy until performance tells us otherwise. But very interested in your thoughts on the matter.

After some thought, I agree with your conclusion. Besides keeping the JSON format definition out of Rust, it also simplifies the architecture, allowing stroller to be a bit more of a black box.

That said, some thoughts on "until performance tells us otherwise":

Agree it might be cheaper. Do you have a sense of the cost? I'm guessing sub 1ms , and if it's more than that then that's not good.

I don't have a good sense of how efficient OCaml's HTTP client library is at writing bytes, or of the distribution of trace sizes. Assuming Cohttp is decent and 10k traces, 1ms per POST is probably reasonable (loopback should be much faster than 10MB/s). For 1MB traces on the other hand I'd want to test it...

The biggest concern I'd have is that the single-threaded OCaml processes will now be making one of these POSTs for every end-user request for every Dark application - vs before when it was just sending trace data once per editor poll. (Again I'm not sure what end-user request volumes we're seeing right now, but presumably end-user requests should eventually become more frequent than the editor poll frequency.)

Do you have a sense for numbers for any of the above (trace size distributions, end-user request volumes per app)? My guess is this is more a scaling concern rather than a "right now" concern, and that even POSTing large traces to stroller will be an improvement relative to the current polling approach.

@pbiggar
Copy link
Member

pbiggar commented Dec 9, 2018

Any idea whether >10k happens 5% of the time or 50%, and what's the main cause when it happens - is it HTTP events with giant unwieldy headers, deeply nested function call graphs, users sending ridiculous form arguments, etc?

Much more like 5%, probably lower. Most of the values being sent down are relatively small, and the things I've observed are:

  • user doing DB::fetchAll on a DB with 5MB in it (tons of small objects in a list)
  • user doing HTTPClient::get on some large page (a single string value)
  • user iterating over a loop (we save the results of every impure function in the loop (aka "function_results"), as well as the arguments to user function).

We clearly need to allow lazy or partial results to be sent to the client but we dont have anything like that now.

However, I'm not sure I'd call them an edge case too much. They do occur frequently, it's not so much 3% of users as 100% of users, 3% of the time. But when it fails, it fails hard and disables the core feature of the platform.

For 1MB traces on the other hand I'd want to test it...

I very much support testing it!

for every end-user request for every Dark application - vs before when it was just sending trace data once per editor poll.

Yes, I agree. Right now, we are polling one HTTP handler per second, and have the option to put timestamps in to check it.

(Again I'm not sure what end-user request volumes we're seeing right now, but presumably end-user requests should eventually become more frequent than the editor poll frequency.)

If we assume a developer is coding 6hrs a day, then we poll 6 * 60 * 60 times per day per coder. So do 1-person coding teams receive more than 21k req/day? Some will. Also, we will need to think about the number of events we actually save - we're not going to always be saving everything.

Do you have a sense for numbers for any of the above (trace size distributions, end-user request volumes per app)? My guess is this is more a scaling concern rather than a "right now" concern, and that even POSTing large traces to stroller will be an improvement relative to the current polling approach.

Traces are often 1k and almost always <100k.

Here's a random selection. If you go to https://darklang.com/a/listo, dont click on anything, and go into the "Network" tab in chrome dev tools, you'll see us fetching traces every second, each time for a random toplevel.
image

Note that the current situation has lots of low hanging fruit (not least of which is "fetch only never than this timestamp").

@pbiggar
Copy link
Member

pbiggar commented Dec 9, 2018

If you're thinking "maybe this whole thing isn't the correct solution to this problem", my current thinking is that having a sidecar proxy solves a large technical problem in our codebase (how to send any non-blocking http call, such as rollbar or any other data from within the ocaml process). It creates a useful tool to allow us optimize the analysis, and even if we end up doing something like moving analysis to another service, we're still going to need a proxy sidecar to send data to it.

I expect it's an important step in enabling websockets in the product, as well as graphql.

@samstokes
Copy link
Contributor Author

Thanks for posting the numbers (and clarification that it's 100% of users 3% of the time vs vice versa). That definitely helps.

If you're thinking "maybe this whole thing isn't the correct solution to this problem", my current thinking is that having a sidecar proxy solves a large technical problem in our codebase (how to send any non-blocking http call, such as rollbar or any other data from within the ocaml process). It creates a useful tool to allow us optimize the analysis, and even if we end up doing something like moving analysis to another service, we're still going to need a proxy sidecar to send data to it.

I expect it's an important step in enabling websockets in the product, as well as graphql.

Yup, I definitely think the sidecar is the right model, sorry if I came across as suggesting otherwise. I was just wanting to flesh out the costs of sidecar-as-passthrough-proxy vs sidecar-as-semi-intelligent-background-worker (as the current state of this PR).

I think you're right that passthrough proxy makes better sense - both for keeping schema definitions out of Rust and for migrating additional calls like Rollbar later on. Given the current request volume I think performance of POSTing the JSON will be adequate - if it turns out to be a problem maybe we can mitigate by doing the POST in the background, e.g. with Lwt.async? (N.B. I don't know Lwt's concurrency model well, assuming it's similar to other async I/O / green-thread models.)

I'll make the changes to pull out the DB calls and pass through the JSON, and push to this PR. (I'll also rebase onto master to fix the ocaml errors in the CI build.) Once I've got to grips with the backend codebase and added the call to stroller, I can do some smoke tests for large traces to see what the performance is like.

@samstokes
Copy link
Contributor Author

Re the 10kB limit. I see Pusher only as a stopgap to get something working, and it sounds like the message size limit will be the biggest reason we'll want to discard it in favour of websockets. Fortunately there's not much work specific to Pusher here - I think most of the work left to be done is 90% the same regardless of Pusher vs websocket:

  • wire up the client to subscribe to the channel and receive traces via push
  • make the backend POST to stroller
  • DevOps Stuff™ to run stroller as a sidecar

I think it's still worth shipping the Pusher version as v1, under a flag, so we can test it out internally (knowing there's a trace size limitation).

Then ditching Pusher and switching to real websockets for v2 will require a little additional work (which using Pusher allowed us to skip), but minimal rework:

  • change client to make a websocket connection rather than a Pusher one
  • change stroller to accept websocket
  • add connection management and lookup to stroller
  • more DevOps Stuff™ to have nginx route websocket connections to stroller (and make sure persistent connections work)

The one piece of remaining work that is Pusher-specific is the wrapper for their REST API discussed above (to work around the lack of HTTPS support in the unofficial Rust library). I could skip that if we're okay with accepting plaintext traffic (e.g. for internal, non-sensitive apps only) for the stopgap period before v2.

Copy link
Contributor

@IanConnolly IanConnolly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rust is looking good so far! Couple of comments about next steps, but no changes requested.


// make actual Pusher call in the background without blocking the caller
// (since that's the whole point of having this in a separate process)
thread::spawn(move || {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably use a threadpool in the n+1 version,?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup - that would work around the Pusher client not being sharable between threads. This current approach is pretty inefficient, since we're making a new HTTP (and TCP) connection to Pusher every time we want to push something, although doing so in a background thread hides that latency from the backend.

That said, I'm pretty sure the n+1 version will be dropping Pusher anyway, and instead we'll just have the map of open client connections, with tokio handling concurrency.


impl Push for push::Client {
fn push(json_bytes: &[u8]) -> Result<(), String> {
// TODO reuse push client!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, i'm imagining each thread having a thread-local pusher connection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, though see above. In the websocket world each thread will keep track of which websocket connection it's pushing to. (That will take some concurrency control, to avoid interleaving packets if two different POSTs come in for the same canvas - e.g. by having all the connections owned by a single "sender" thread onto which the request threads spawn futures.)

@samstokes
Copy link
Contributor Author

Further design discussion via phone with @pbiggar and @IanConnolly - recording outcome here for posterity:

We decided to reduce the scope slightly. Instead of pushing entire traces to the client, we'll just use push to notify the client when new traces are available, leaving retrieval of the full trace details to the existing "pull" path for now. This way OCaml only has to push lists of trace ids and tlids on every end-user request, instead of entire traces. This should still be enough to allow us to disable the 1-second poll, and avoids some potential issues:

  • reduces the amount of analysis work OCaml has to do
  • reduces the size of payload OCaml has to serialize (skirting concerns about I/O latency and/or Lwt-intensive optimisation strategies)
  • reduces the size of payload we're sending through Pusher, mitigating the 10kB message limit
  • means we're not pushing any sensitive data (trace ids and tlids are not sensitive), so we can make do with the lack of TLS support for now

Since the last two points mitigate the main issues with Pusher (vs websockets), we'll plan to stick with Pusher after all for now, so I'll revive the work to replace the Pusher Rust library and provide TLS support.

@pbiggar @IanConnolly I believe implementing the above decisions mostly impacts my changes to client and backend (in flight) and doesn't change the stroller code much, so LMK if this is okay to merge as is (again, with nothing using it yet).

It still needs changes to route messages to the right canvas, and to swap out the Pusher credentials for a Dark-owned Pusher account rather than my free test account, but I plan to address those in future PRs.

Copy link
Contributor

@IanConnolly IanConnolly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! 🚢 @samstokes

@IanConnolly IanConnolly merged commit c96e5ca into master Dec 14, 2018
@pbiggar pbiggar deleted the sam/stroller-prototype branch December 14, 2018 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants