For today's Skunkworks project, I'm going to be showing how to continuously tail a materialized view and stream the results directly to one or more users' browsers.
The rough dataflow will look like:
source -> materialized <- tornado <- clients
I was able to create an end-to-end, event driven streaming pipeline using Materialized views. The end workflow ended up looking something more like this:
source -> file <- materialized <- tail | post -> tornado <- clients
Basically, we have a source that is reading data from a remote source and writing the contents to a file. That file is being tailed by materialize and using the results of that to update materialized views. This is exactly as described in the demo docs.
The parts I had to figure out today were the following:
I've never actually used Materialized. Pleasantly this took very little time and worked as expected.
Creating two scripts that could read the results of the tail and pipe that data into Tornado. I
ended up needing two scripts, as I could not find a fast way to implement a non-blocking method
for using psycopg2 copy_expert
and ended up resortin to unix pipes. So, there is one script that
reads the result of the tail and outputs that to stdout. The second script reads from stdin,
parses the Materialize tail output, converts to JSON and POSTs the result of that to the Tornado
server.
This part was pretty straightforward. Create a non-blocking Tornado server that can query the
database using momoko
as a non-blocking wrapper around psycopg2
, accepts JSON blobs via a POST
request and broadcast those post requests.
Pretty simple little bit of code to open a websocket, read the results and update HTML in response. The Top10 editors even get a nice little barchart that auto-updates in all sorts of funny ways.
I'd love to cut out most of the middlemen and simply expose changelogs (tail
) over websockets
from materialized. This would cut out most of the complexity. It would also enable a much simpler
"base + patches" stream of updates, as the client wouldn't need to figure out how to synchronize
the view and changelog.
-
Spin up materialized and ingest wikipedia data
-
Write script to create interesting materialized views
-
Figure out how to tail messages from the console
-
Get tail working in vanilla Python
-
Spin up tornado to splat data onto the screen
-
What's a good dataset to show this off?
-
Come up with a few interesting materialized views.
-
Pipe output from
copy_expert
into a program that posts messages to tornado -
Write some javascript to receive these messages
-
Write some javascript to update HTML on these messages!
-
Perhaps visualize the results of the various views?
-
Implement a sane update mechanism for the visualization
-
Stretch Goal -- Stream Materialize internal tables to browser too
-
Can we find two datasets that are good to show off joins?
- Figure out how to use tail with Momoko or Tornado
- Accept an incoming connection request,
GET /view/stream
- Insert client socket onto list of listeners for
delta
messages - Query current view to produce a
diff
message.
The client will receive a stream of messages from the server. There will be two types of messages:
base
delta
A base
message contains the results of the view at a given point in time, queried upon first
connection by an individual client. The delta
message is an array of inserts and deletes that
should be applied continuously to base to keep the dataset in sync.
Clients should only ever see a single base
message and should expect to see a never ending
stream of delta
messages.
On initial load, the client may start seeing delta
messages before it sees a base
message.
This is to ensure that the client sees all messages required to keep the dataset in sync. Clients
should buffer, and optionally compact, all delta
messages until the base
message arrives.
delta
messages are considered idempotent. It's possible that the result of applying a delta
message will result in no changes, as the base
message query may already have the delta
message applied within the database.