Stroller prototype #478

samstokes · 2018-12-06T23:41:42Z

This is a functioning sketch of how the stroller logic will look. The rough idea is that:

when the server has stored a new trace
it will make a POST request to stroller at /canvas/:canvas_uuid/trace/:trace_uuid/events
that tells stroller to look up the stored events and function arguments/results for the specified canvas and trace, and send them (via Pusher) to the specified canvas

In the interests of getting to the interesting part, this PR takes a few steps at once:

adds Postgres support (via diesel) for accessing trace data
integrates Pusher (via pusher-http-rust - see below)
sets up the HTTP server to listen for POST requests.

Still to do

Besides the obvious (actually making the backend call stroller, and the editor subscribe to Pusher), a few things are missing or could be improved. Given that nothing calls this yet, though, I propose they can wait until subsequent PRs.

stroller's connection to the Pusher servers is unencrypted, because pusher-http-rust doesn't support TLS (! - see below). I'm guessing this would be a deal breaker for production use, since it would mean sending end-user data in plaintext across the public Internet.
this currently just pushes to a test Pusher channel - in production this will need to route the Pusher message to the appropriate canvas (e.g. by using the canvas UUID as the channel name)
- do we currently support concurrent editing of the same canvas by multiple users (or the same user in multiple browser tabs)?
this currently uses Pusher credentials from a test account I signed up for - once this runs anywhere other than my laptop we'll presumably want prod and dev accounts with credentials in Dark's preferred credential sharing service.

Pusher client

This gets its own section because it's a mess. I started off by using the pusher crate - it's under the pusher-community Github org but seemed at least semi-official. Unfortunately it turns out to be basically unmaintained - the last substantial commit is over 18 months old, and it depends on some old libraries. Among other issues:

it depends on hyper, the same HTTP crate I'm using for the stroller web server, but an older version that relies on a pre-release version of tokio
that older version of hyper is too old to be compatible with hyper-tls, which seems to be the more supported of the two hyper plugins adding TLS support (hence the unencrypted connection noted above)

I briefly looked into forking it to bring it up to date with current hyper, but the hyper API changes are significant and that didn't look like a good investment of time.

Instead we should probably just ditch this crate and write our own wrapper for the Pusher HTTP API, which is well documented and not very complicated. Unless we decide this is reason enough to ditch Pusher and deal with long-lived HTTP connections directly to stroller, I plan to work on that next.

samstokes · 2018-12-07T00:21:00Z

Oh, re the choice of diesel over raw rust-postgres. This is mainly because I'm familiar with it from a side project, and partly out of personal preference for query-builder APIs over writing raw SQL. Definitely willing to revisit if you're not a fan (or if Dark in general prefers to avoid ORM-like libraries).

pbiggar · 2018-12-07T00:28:12Z

i'm a little worried about the maintenance burden of having to write all our API's json packets in rust, vs ocaml which we're all going to have to know anyway.

I'd love to hear your thoughts on why pulling from the DB (and then constructing the json packets in rust) rather than just doing pure forwarding of packets from ocaml to pusher.

samstokes · 2018-12-07T00:51:04Z

i'm a little worried about the maintenance burden of having to write all our API's json packets in rust, vs ocaml which we're all going to have to know anyway.

I'd love to hear your thoughts on why pulling from the DB (and then constructing the json packets in rust) rather than just doing pure forwarding of packets from ocaml to pusher.

Hmm, to make sure I'm understanding the concern correctly - you want to avoid the format for the JSON that gets pushed to the client being specified in Rust? Right now that format is literally just "whatever's in the value column of the stored_events_v2 table", so you don't need to know any Rust to update that format. (i.e. right now "constructing the JSON packets in Rust" is literally just passing through the value field. The "message" field you can see in the code is part of the Pusher protocol, not part of the format the editor would consume.)

That said I could imagine us wanting to augment that (e.g. to send a whole trace rather than just an individual event as it does now, something like {"events": [...], "function_calls": [...]}; or to mask certain fields from the value JSON), and at that point there would be a format that would need specifying, and I see the concern there.

It'd certainly be feasible to have stroller just pass along whatever JSON it received in a POST. The main reasons I did it this way:

it seemed like the cheapest thing perf-wise - OCaml just has to make an empty POST, rather than writing out a JSON blob that it already wrote once to Postgres. (That said, Pusher caps out at 10k per message anyway unless we get special privileges, so maybe it's premature to care about large JSON blobs.)
I thought this might make it easier to get all the bits of a given trace at once - events, function arguments and function results - because stroller knows just enough about the DB layout to assemble a trace from its component parts. OCaml can just tell stroller "I just saved trace aabbcc-123" and stroller can grab all the bits. Could also totally see the argument for keeping stroller away from the DB and just being a passthrough.

What do you think? I didn't burn many cycles on the database parts and it wouldn't be hard to strip them out again.

pbiggar · 2018-12-07T01:04:05Z

Going to reread this a few times and have a think, but let me add one more complication: often a single trace is over 10k. I see some 50k traces, and we once had a 5MB trace.

pbiggar · 2018-12-07T01:06:28Z

One more thing:

the number of stored_events coming in is going to be much much larger than the number of people with their editor open. Will stroller be sending the data unconditionally (regardless of whether a user is connected)? If so than this constraint doesn't matter.

pbiggar · 2018-12-07T02:03:18Z

Hmm, to make sure I'm understanding the concern correctly - you want to avoid the format for the JSON that gets pushed to the client being specified in Rust?

Yes, correct. I'm worried about the engineering overhead of adding new data to be sent to the client. It's already a bit expensive to specify json in different ways on the frontend and the backend (working on it!), and this seems another place.

That said I could imagine us wanting to augment that (e.g. to send a whole trace rather than just an individual event as it does now, something like {"events": [...], "function_calls": [...]}; or to mask certain fields from the value JSON), and at that point there would be a format that would need specifying, and I see the concern there.

Right, this is exactly what I'm thinking of.

it seemed like the cheapest thing perf-wise - OCaml just has to make an empty POST, rather than writing out a JSON blob that it already wrote once to Postgres. (That said, Pusher caps out at 10k per message anyway unless we get special privileges, so maybe it's premature to care about large JSON blobs.)

Agree it might be cheaper. Do you have a sense of the cost? I'm guessing sub 1ms , and if it's more than that then that's not good.

I thought this might make it easier to get all the bits of a given trace at once - events, function arguments and function results - because stroller knows just enough about the DB layout to assemble a trace from its component parts. OCaml can just tell stroller "I just saved trace aabbcc-123" and stroller can grab all the bits. Could also totally see the argument for keeping stroller away from the DB and just being a passthrough.

I do see this argument too - perhaps we'll need to cut up things in a special format or something.

My current thinking is to perhaps operate strictly as a pass-through proxy until performance tells us otherwise. But very interested in your thoughts on the matter.

samstokes · 2018-12-07T21:44:06Z

Going to reread this a few times and have a think, but let me add one more complication: often a single trace is over 10k. I see some 50k traces, and we once had a 5MB trace.

Any idea whether >10k happens 5% of the time or 50%, and what's the main cause when it happens - is it HTTP events with giant unwieldy headers, deeply nested function call graphs, users sending ridiculous form arguments, etc?

If large traces are an edge case, then I'd be inclined to push ahead (pun intended) with Pusher for the moment, accepting that large traces wouldn't get pushed to the editor for the initial version. Two ways to mitigate:

just move forward with getting rid of Pusher in favour of real websockets ASAP after validating this approach
have a less-frequent "catchup" poll (e.g. with a "here are the traces I already have" mechanism) - we might want this anyway, since neither Pusher nor websockets would 100% guarantee delivery of every trace (see musings here)

If large traces are common enough that not pushing them would hurt the utility of this feature, maybe we could mitigate by splitting them up (e.g. sending one function argument/result pair at a time)?

samstokes · 2018-12-07T21:52:22Z

the number of stored_events coming in is going to be much much larger than the number of people with their editor open. Will stroller be sending the data unconditionally (regardless of whether a user is connected)? If so than this constraint doesn't matter.

Yup, I'd assumed that stored_events would be high in volume. OCaml will always post to stroller and stroller will always do something:

in the Pusher case, I was assuming stroller would always push traces to Pusher, leaving it up to Pusher to decide whether to actually deliver the traces to anyone. That does mean we'd be consuming our Pusher events/day quota even when nobody was editing. I don't know what sort of end-user request volume we're seeing, but Pusher's paid plans start at 1 million events/day which I'm guessing is good enough for validation?
- if that's a problem, it won't be hard to query Pusher first to check if anyone is subscribed, and skip pushing the trace if not (although that introduces a race condition which would increase the best-effort-ness of the delivery)
once we go to websockets, stroller will have a map of open connections, so it'll simply not push anything if nobody's listening.

samstokes · 2018-12-07T22:38:35Z

My current thinking is to perhaps operate strictly as a pass-through proxy until performance tells us otherwise. But very interested in your thoughts on the matter.

After some thought, I agree with your conclusion. Besides keeping the JSON format definition out of Rust, it also simplifies the architecture, allowing stroller to be a bit more of a black box.

That said, some thoughts on "until performance tells us otherwise":

Agree it might be cheaper. Do you have a sense of the cost? I'm guessing sub 1ms , and if it's more than that then that's not good.

I don't have a good sense of how efficient OCaml's HTTP client library is at writing bytes, or of the distribution of trace sizes. Assuming Cohttp is decent and 10k traces, 1ms per POST is probably reasonable (loopback should be much faster than 10MB/s). For 1MB traces on the other hand I'd want to test it...

The biggest concern I'd have is that the single-threaded OCaml processes will now be making one of these POSTs for every end-user request for every Dark application - vs before when it was just sending trace data once per editor poll. (Again I'm not sure what end-user request volumes we're seeing right now, but presumably end-user requests should eventually become more frequent than the editor poll frequency.)

Do you have a sense for numbers for any of the above (trace size distributions, end-user request volumes per app)? My guess is this is more a scaling concern rather than a "right now" concern, and that even POSTing large traces to stroller will be an improvement relative to the current polling approach.

pbiggar · 2018-12-09T06:38:21Z

Any idea whether >10k happens 5% of the time or 50%, and what's the main cause when it happens - is it HTTP events with giant unwieldy headers, deeply nested function call graphs, users sending ridiculous form arguments, etc?

Much more like 5%, probably lower. Most of the values being sent down are relatively small, and the things I've observed are:

user doing DB::fetchAll on a DB with 5MB in it (tons of small objects in a list)
user doing HTTPClient::get on some large page (a single string value)
user iterating over a loop (we save the results of every impure function in the loop (aka "function_results"), as well as the arguments to user function).

We clearly need to allow lazy or partial results to be sent to the client but we dont have anything like that now.

However, I'm not sure I'd call them an edge case too much. They do occur frequently, it's not so much 3% of users as 100% of users, 3% of the time. But when it fails, it fails hard and disables the core feature of the platform.

For 1MB traces on the other hand I'd want to test it...

I very much support testing it!

for every end-user request for every Dark application - vs before when it was just sending trace data once per editor poll.

Yes, I agree. Right now, we are polling one HTTP handler per second, and have the option to put timestamps in to check it.

(Again I'm not sure what end-user request volumes we're seeing right now, but presumably end-user requests should eventually become more frequent than the editor poll frequency.)

If we assume a developer is coding 6hrs a day, then we poll 6 * 60 * 60 times per day per coder. So do 1-person coding teams receive more than 21k req/day? Some will. Also, we will need to think about the number of events we actually save - we're not going to always be saving everything.

Do you have a sense for numbers for any of the above (trace size distributions, end-user request volumes per app)? My guess is this is more a scaling concern rather than a "right now" concern, and that even POSTing large traces to stroller will be an improvement relative to the current polling approach.

Traces are often 1k and almost always <100k.

Here's a random selection. If you go to https://darklang.com/a/listo, dont click on anything, and go into the "Network" tab in chrome dev tools, you'll see us fetching traces every second, each time for a random toplevel.

Note that the current situation has lots of low hanging fruit (not least of which is "fetch only never than this timestamp").

pbiggar · 2018-12-09T06:44:41Z

If you're thinking "maybe this whole thing isn't the correct solution to this problem", my current thinking is that having a sidecar proxy solves a large technical problem in our codebase (how to send any non-blocking http call, such as rollbar or any other data from within the ocaml process). It creates a useful tool to allow us optimize the analysis, and even if we end up doing something like moving analysis to another service, we're still going to need a proxy sidecar to send data to it.

I expect it's an important step in enabling websockets in the product, as well as graphql.

samstokes · 2018-12-10T21:10:08Z

Thanks for posting the numbers (and clarification that it's 100% of users 3% of the time vs vice versa). That definitely helps.

If you're thinking "maybe this whole thing isn't the correct solution to this problem", my current thinking is that having a sidecar proxy solves a large technical problem in our codebase (how to send any non-blocking http call, such as rollbar or any other data from within the ocaml process). It creates a useful tool to allow us optimize the analysis, and even if we end up doing something like moving analysis to another service, we're still going to need a proxy sidecar to send data to it.

I expect it's an important step in enabling websockets in the product, as well as graphql.

Yup, I definitely think the sidecar is the right model, sorry if I came across as suggesting otherwise. I was just wanting to flesh out the costs of sidecar-as-passthrough-proxy vs sidecar-as-semi-intelligent-background-worker (as the current state of this PR).

I think you're right that passthrough proxy makes better sense - both for keeping schema definitions out of Rust and for migrating additional calls like Rollbar later on. Given the current request volume I think performance of POSTing the JSON will be adequate - if it turns out to be a problem maybe we can mitigate by doing the POST in the background, e.g. with Lwt.async? (N.B. I don't know Lwt's concurrency model well, assuming it's similar to other async I/O / green-thread models.)

I'll make the changes to pull out the DB calls and pass through the JSON, and push to this PR. (I'll also rebase onto master to fix the ocaml errors in the CI build.) Once I've got to grips with the backend codebase and added the call to stroller, I can do some smoke tests for large traces to see what the performance is like.

samstokes · 2018-12-10T21:23:04Z

Re the 10kB limit. I see Pusher only as a stopgap to get something working, and it sounds like the message size limit will be the biggest reason we'll want to discard it in favour of websockets. Fortunately there's not much work specific to Pusher here - I think most of the work left to be done is 90% the same regardless of Pusher vs websocket:

wire up the client to subscribe to the channel and receive traces via push
make the backend POST to stroller
DevOps Stuff™ to run stroller as a sidecar

I think it's still worth shipping the Pusher version as v1, under a flag, so we can test it out internally (knowing there's a trace size limitation).

Then ditching Pusher and switching to real websockets for v2 will require a little additional work (which using Pusher allowed us to skip), but minimal rework:

change client to make a websocket connection rather than a Pusher one
change stroller to accept websocket
add connection management and lookup to stroller
more DevOps Stuff™ to have nginx route websocket connections to stroller (and make sure persistent connections work)

The one piece of remaining work that is Pusher-specific is the wrapper for their REST API discussed above (to work around the lack of HTTPS support in the unofficial Rust library). I could skip that if we're okay with accepting plaintext traffic (e.g. for internal, non-sensitive apps only) for the stopgap period before v2.

…t event and push it

IanConnolly

The rust is looking good so far! Couple of comments about next steps, but no changes requested.

IanConnolly · 2018-12-11T23:14:20Z

stroller/src/push.rs

+
+        // make actual Pusher call in the background without blocking the caller
+        // (since that's the whole point of having this in a separate process)
+        thread::spawn(move || {


this should probably use a threadpool in the n+1 version,?

Yup - that would work around the Pusher client not being sharable between threads. This current approach is pretty inefficient, since we're making a new HTTP (and TCP) connection to Pusher every time we want to push something, although doing so in a background thread hides that latency from the backend.

That said, I'm pretty sure the n+1 version will be dropping Pusher anyway, and instead we'll just have the map of open client connections, with tokio handling concurrency.

IanConnolly · 2018-12-11T23:15:04Z

stroller/src/service.rs

+
+impl Push for push::Client {
+    fn push(json_bytes: &[u8]) -> Result<(), String> {
+        // TODO reuse push client!


similarly, i'm imagining each thread having a thread-local pusher connection?

Right, though see above. In the websocket world each thread will keep track of which websocket connection it's pushing to. (That will take some concurrency control, to avoid interleaving packets if two different POSTs come in for the same canvas - e.g. by having all the connections owned by a single "sender" thread onto which the request threads spawn futures.)

samstokes · 2018-12-13T21:18:06Z

Further design discussion via phone with @pbiggar and @IanConnolly - recording outcome here for posterity:

We decided to reduce the scope slightly. Instead of pushing entire traces to the client, we'll just use push to notify the client when new traces are available, leaving retrieval of the full trace details to the existing "pull" path for now. This way OCaml only has to push lists of trace ids and tlids on every end-user request, instead of entire traces. This should still be enough to allow us to disable the 1-second poll, and avoids some potential issues:

reduces the amount of analysis work OCaml has to do
reduces the size of payload OCaml has to serialize (skirting concerns about I/O latency and/or Lwt-intensive optimisation strategies)
reduces the size of payload we're sending through Pusher, mitigating the 10kB message limit
means we're not pushing any sensitive data (trace ids and tlids are not sensitive), so we can make do with the lack of TLS support for now

Since the last two points mitigate the main issues with Pusher (vs websockets), we'll plan to stick with Pusher after all for now, so I'll revive the work to replace the Pusher Rust library and provide TLS support.

@pbiggar @IanConnolly I believe implementing the above decisions mostly impacts my changes to client and backend (in flight) and doesn't change the stroller code much, so LMK if this is okay to merge as is (again, with nothing using it yet).

It still needs changes to route messages to the right canvas, and to swap out the Pusher credentials for a Dark-owned Pusher account rather than my free test account, but I plan to address those in future PRs.

IanConnolly

Yep! 🚢 @samstokes

samstokes requested a review from IanConnolly December 6, 2018 23:45

samstokes force-pushed the sam/stroller-prototype branch from bf4ba25 to c74e1d4 Compare December 7, 2018 21:09

samstokes mentioned this pull request Dec 7, 2018

Add clippy and rustfmt to Rust build #486

Closed

samstokes added 13 commits December 10, 2018 15:43

Add pusher-http-rust crate

9ce8ef3

Working example Pusher call (but no TLS)

8e363dd

add diesel crate for Postgres access and query builder

5c5671d

add chrono and uuid crates

599d206

Add Postgres connection

8e90ed8

Load stored_events from DB

1b7cad3

Parse JSON (so the Pusher client can serialize it again 🤦)

2033ec8

Bend into approximately the shape we'd expect: on POST, look up lates…

59a6f45

…t event and push it

add diesel r2d2 connection pool support

b619aaa

avoid reconnecting to Postgres for every request

216b965

Wrap Pusher SDK to discourage lockin, and move secrets into config

a3cd602

fix bad formatting

14e4f51

stylistic improvements to placate clippy

ec7fd1a

samstokes force-pushed the sam/stroller-prototype branch from c74e1d4 to ec7fd1a Compare December 10, 2018 21:48

samstokes added 7 commits December 10, 2018 22:18

Remove diesel, chrono and uuid crates

ccb1064

add futures crate

2ac0e18

just pass through trace JSON to pusher; remove all DB code

6895d18

Move service into its own file

4135032

Mock out pusher in tests

b2bdfa6

Remove unnecessary to_vec

1181c8b

fix formatting

bfaac1a

IanConnolly reviewed Dec 11, 2018

View reviewed changes

samstokes added the needs-review label Dec 13, 2018

IanConnolly approved these changes Dec 13, 2018

View reviewed changes

IanConnolly removed the needs-review label Dec 13, 2018

IanConnolly merged commit c96e5ca into master Dec 14, 2018

pbiggar deleted the sam/stroller-prototype branch December 14, 2018 18:46

samstokes mentioned this pull request Feb 1, 2019

Implement Pusher client in Rust for TLS support #604

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stroller prototype #478

Stroller prototype #478

samstokes commented Dec 6, 2018

samstokes commented Dec 7, 2018

pbiggar commented Dec 7, 2018

samstokes commented Dec 7, 2018

pbiggar commented Dec 7, 2018

pbiggar commented Dec 7, 2018

pbiggar commented Dec 7, 2018 •

edited

Loading

samstokes commented Dec 7, 2018

samstokes commented Dec 7, 2018

samstokes commented Dec 7, 2018

pbiggar commented Dec 9, 2018

pbiggar commented Dec 9, 2018 •

edited

Loading

samstokes commented Dec 10, 2018

samstokes commented Dec 10, 2018

IanConnolly left a comment

IanConnolly Dec 11, 2018

samstokes Dec 12, 2018

IanConnolly Dec 11, 2018

samstokes Dec 12, 2018

samstokes commented Dec 13, 2018

IanConnolly left a comment

Stroller prototype #478

Stroller prototype #478

Conversation

samstokes commented Dec 6, 2018

Still to do

Pusher client

samstokes commented Dec 7, 2018

pbiggar commented Dec 7, 2018

samstokes commented Dec 7, 2018

pbiggar commented Dec 7, 2018

pbiggar commented Dec 7, 2018

pbiggar commented Dec 7, 2018 • edited Loading

samstokes commented Dec 7, 2018

samstokes commented Dec 7, 2018

samstokes commented Dec 7, 2018

pbiggar commented Dec 9, 2018

pbiggar commented Dec 9, 2018 • edited Loading

samstokes commented Dec 10, 2018

samstokes commented Dec 10, 2018

IanConnolly left a comment

Choose a reason for hiding this comment

IanConnolly Dec 11, 2018

Choose a reason for hiding this comment

samstokes Dec 12, 2018

Choose a reason for hiding this comment

IanConnolly Dec 11, 2018

Choose a reason for hiding this comment

samstokes Dec 12, 2018

Choose a reason for hiding this comment

samstokes commented Dec 13, 2018

IanConnolly left a comment

Choose a reason for hiding this comment

pbiggar commented Dec 7, 2018 •

edited

Loading

pbiggar commented Dec 9, 2018 •

edited

Loading