Skip to content

Latest commit

 

History

History
265 lines (234 loc) · 13.3 KB

2023-01-04.md

File metadata and controls

265 lines (234 loc) · 13.3 KB

Advent of Code 2022: In SQL

I was recently pointed at 2022's Advent of Code day 16, on account of it appears to be a problem that lends itself to dynamic programming. We just had a blog post about Materialize's WITH MUTUALLY RECURSIVE, which is meant to make task of the dynamic programming nature relatively easier.

So I set out to see whether it would be easy or hard or somewhere in between to implement this in SQL with WITH MUTUALLY RECURSIVE.

The problem

If you click on the link up there and sign in, you'll get a better explanation of the problem than I'll be able to give you. The tl;dr is that you have a directed graph whose nodes have valves which release pressure when opened. Your job given a fixed budget of 30 actions is to walk from node to node (one action) and open valves (one action), such that by the end of 30 actions the largest total amount of pressure has been released. The catch is that the pressure continues to be released each action, so when you open a valve you get points for each remaining round, rather than a fixed reward.

This problem is most likely NP-hard, as it seems a short homework assignment to show that Hamilton path reduces to the unit-pressure case of this problem. Given that it is NP-hard, we have to make some guesses about what an efficient implementation might look like. I recommend looking at the example problems; if they look anything like mine most of the graph nodes have valves with zero pressure to be released (only 15 of my nodes did not).

The data are presented as lines like so (the example from the link)

Valve AA has flow rate=0   tunnels lead to valves DD, II, BB
Valve BB has flow rate=13  tunnels lead to valves CC, AA
Valve CC has flow rate=2   tunnels lead to valves DD, BB
Valve DD has flow rate=20  tunnels lead to valves CC, AA, EE
Valve EE has flow rate=3   tunnels lead to valves FF, DD
Valve FF has flow rate=0   tunnels lead to valves EE, GG
Valve GG has flow rate=0   tunnels lead to valves FF, HH
Valve HH has flow rate=22  tunnel leads to valve GG
Valve II has flow rate=0   tunnels lead to valves AA, JJ
Valve JJ has flow rate=21  tunnel leads to valve II

From this, you might (rightly) conclude that we'll need something like

CREATE TABLE valves (name text, rate int);
CREATE TABLE steps (src text, dst text);

All we have to do is figure out how to get the answer!

A first solution

There are a few solutions here, and I went through several of decreasing complexity. Most of the complexity was due to me finding better SQL functions to avoid complex queries

I did get several hints though!

  1. I got pointed here with the hint that dynamic programming makes sense.
  2. The internet revealed a smart path-compression trick.

The main thing we are going to write is a (recursive) query to determine the highest score you could have in each "configuration" of the game. The configurations are, I think, pretty much determined by where you currently are, which valves are still available (closed), and how many steps you have left. It doesn't really matter what steps you took to get to this configuration, other than the score you arrive with, because your future options are only determined by these three things.

Based on this, we'll write a recursive query to track the best score you could have in each state. We'll also leave some breadcrumbs behind (old_pos) to let us find the actual path at the end.

WITH MUTUALLY RECURSIVE 
    -- Track a best score for each position, still closed valves, and remaining steps.
    -- Also track the previous position we came from, to find the solution (not just the score).
    state(pos text, closed text[], steps_left int, score int, old_pos text) AS (
        -- What could possibly go here????
    )
SELECT * FROM state;

The most natural thing from my point of view is to define state as an aggregation with keys pos, closed, and steps_left, retaining the maximum score and whatever old_pos that corresponds to.

WITH MUTUALLY RECURSIVE 
    -- Track a best score for each position, still closed valves, and remaining steps.
    -- Also track the previous position we came from, to find the solution (not just the score).
    state(pos text, closed text[], steps_left int, score int, old_pos text) AS (
        -- `DISTINCT ON` is postgresql magic that implements `argmax`.
        SELECT DISTINCT ON(pos, closed, steps_left) pos, closed, steps_left, score, old_pos
        FROM (
            -- What could possibly go here????
        )
        ORDER BY pos, closed, steps_left, score DESC, old_pos
    )
SELECT * FROM state;

We now need to think about which sort of options we should provide when determining the best score for each (pos, closed, steps_left). There is going to be a beefy bit, but for sure one option you'll always have is to start from the initial state with all valves closed, 30 steps remaining, and zero points (we stash that in the view initial_state).

WITH MUTUALLY RECURSIVE 
    -- Track a best score for each position, still closed valves, and remaining steps.
    -- Also track the previous position we came from, to find the solution (not just the score).
    state(pos text, closed text[], steps_left int, score int, old_pos text) AS (
        -- `DISTINCT ON` is postgresql magic that implements `argmax` from math.
        SELECT DISTINCT ON(pos, closed, steps_left) pos, closed, steps_left, score, old_pos
        FROM (
            SELECT * FROM initial_state
            UNION ALL
            -- What could possibly go here????

        )
        ORDER BY pos, closed, steps_left, score DESC, old_pos
    )
SELECT * FROM state;

This is the moment where you, like me, might start to bonk your head into the various ways to describe the recursive rule: From any existing state, can either open a valve (and score points) or walk to an adjacent position. In the first case I update closed and score. In the second case I update pos. In each case I decrement steps_left by one. That's the heart of it.

Don't forget that you can only open valves that are not yet closed! I didn't forget this, but certainly did not know how to check this without some gory subquery based on unnest(closed). SQL does not always have the most ergonomic ways to write things that are easy in Rust. As it turns out, I got a bit lucky and Materialize implements a surprising function array_remove that saved my bacon here. Watch for it in the full answer!

A second solution

What was most helpful was the internet observing that while you have all these steps that allow you to take one step here and there, you could also get the shortest path length between every location. And if you are really smart (not me; I stole this too) you notice that you only care to go to locations that have positive rate.

This is a pretty standard recursive query to write.

-- Paths from `src` to `dst` with a `len` and destination `rate`.
CREATE MATERIALIZED VIEW paths AS
WITH MUTUALLY RECURSIVE
    -- For each `(src, dst)` pair, the shortest `len` path.
    paths (src text, dst text, len int) AS (
        SELECT old_pos, new_pos, min(len)
        FROM (
            SELECT old_pos, new_pos, 1 as len
            FROM transitions
            UNION ALL
            SELECT src, new_pos, len + 1
            FROM paths, transitions
            WHERE paths.dst = transitions.old_pos
        )
        GROUP BY old_pos, new_pos
    )
-- Retain destinations with non-zero pressure.
SELECT src, dst, len, rate
FROM paths, valves
WHERE paths.dst = valves.pos
AND valves.rate > 0;

Well this is great. We now have a relation paths that tells us how long it takes to reach each valve, and what rate we'll find when we arrive. All we need now is to complete the transition rule up above:

WITH MUTUALLY RECURSIVE
    -- Keep the best option for each (pos, closed, steps) configuration.
    state(pos text, closed text[], steps_left int, score int, old_pos text) AS (
        SELECT DISTINCT ON (pos, closed, steps_left) pos, closed, steps_left, score, old_pos
        FROM (
            -- Can always start from the initial state ..
            SELECT * FROM initial_state
            UNION ALL
            -- .. or follow a path to a closed valve and open it.
            SELECT 
                paths.dst, 
                array_remove(closed, paths.dst), 
                steps_left - paths.len - 1, 
                score + (steps_left - paths.len - 1) * rate, 
                paths.src
            FROM state, paths
            WHERE paths.src = state.pos
              -- only take if enough steps remain to score.
              AND steps_left - paths.len > 1
              -- only take if valve is still closed.
              AND NOT array_remove(closed, paths.dst) = closed
        )
        ORDER BY pos, closed, steps_left, score DESC, old_pos
    )
-- Present several of the best scoring states.
SELECT * FROM state ORDER BY score DESC, steps_left DESC LIMIT 10;

I'm using a local build of Materialize, where this completes in ..

 pos |         closed         | steps_left | score | old_pos 
-----+------------------------+------------+-------+---------
 PL  | {DW,EW,MJ,RU,WJ,YJ,ZM} |          4 |  2250 | UD
 WJ  | {CO,DW,EW,MJ,RU,YJ,ZM} |          3 |  2241 | PL
 EW  | {CO,DW,MJ,RU,WJ,YJ,ZM} |          3 |  2214 | PL
 EW  | {DW,MJ,PL,RU,WJ,YJ,ZM} |          3 |  2210 | UD
 WJ  | {DW,EW,FD,MJ,RU,YJ,ZM} |          3 |  2202 | PL
 UD  | {DW,EW,MJ,PL,RU,WJ,YJ} |          2 |  2192 | UU
 WJ  | {DW,EW,MJ,PL,RU,YJ,ZM} |          1 |  2187 | UD
 YJ  | {DW,EW,MJ,PL,RU,UD,WJ} |          2 |  2184 | UU
 UD  | {DW,EW,MJ,RU,WJ,YJ,ZM} |          2 |  2177 | PL
 DW  | {CO,EW,MJ,RU,WJ,YJ,ZM} |          1 |  2176 | PL
(10 rows)

Time: 4928.808 ms (00:04.929)
materialize=> 

Fortunately, 2250 was the correct answer for me. This was after only a bit of flailing while I failed to recall that you need DESC to get the largest thing out of ORDER BY, rather than smallest.

A third solution

It turns out that a lot of the time is spent in the SELECT DISTINCT ON logic. Materialize has a robust implementation of this that can tolerate billions of elements with the same key, using a hierarchical reduction, and .. it's all bit much for this data. Instead, we can just record the MAX(score) without old_pos, and use a hint to skip the robust planning:

    -- Faster version that doesn't leave breadcrumbs behind. 
    WITH MUTUALLY RECURSIVE
        -- Keep the best score for each (pos, closed, steps) configuration.
        state(pos text, closed text[], steps_left int, score int, old_pos text) AS (
            SELECT pos, closed, steps_left, MAX(score), 'N/A'
            FROM (
                -- Can always start from the initial state ..
                SELECT * FROM initial_state
                UNION ALL
                -- .. or follow a path to a closed valve and open it.
                SELECT 
                    paths.dst, 
                    array_remove(closed, paths.dst), 
                    steps_left - paths.len - 1, 
                    score + (steps_left - paths.len - 1) * pressure, 
                    paths.src
                FROM state, paths
                WHERE paths.src = state.pos
                  -- only take if enough steps remain to score.
                  AND steps_left - paths.len > 1
                  -- only take if valve is still closed.
                  AND NOT array_remove(closed, paths.dst) = closed
            )
            GROUP BY pos, closed, steps_left
            OPTIONS (EXPECTED GROUP SIZE = 1)
        )
    SELECT * FROM state ORDER BY score DESC, steps_left DESC LIMIT 10;

Rather than SELECT DISTINCT ON we are doing a GROUP BY with an OPTIONS clause. That clause is what tells Materialize to skip the robustness and imagine that groups will be small. With that, you get what I hope is the same anwser, minus the usefulness of the old_pos column. But, the time goes down to just over a second, so that's pretty great!

 pos |         closed         | steps_left | score | old_pos 
-----+------------------------+------------+-------+---------
 PL  | {DW,EW,MJ,RU,WJ,YJ,ZM} |          4 |  2250 | N/A
 WJ  | {CO,DW,EW,MJ,RU,YJ,ZM} |          3 |  2241 | N/A
 EW  | {CO,DW,MJ,RU,WJ,YJ,ZM} |          3 |  2214 | N/A
 EW  | {DW,MJ,PL,RU,WJ,YJ,ZM} |          3 |  2210 | N/A
 WJ  | {DW,EW,FD,MJ,RU,YJ,ZM} |          3 |  2202 | N/A
 UD  | {DW,EW,MJ,PL,RU,WJ,YJ} |          2 |  2192 | N/A
 WJ  | {DW,EW,MJ,PL,RU,YJ,ZM} |          1 |  2187 | N/A
 YJ  | {DW,EW,MJ,PL,RU,UD,WJ} |          2 |  2184 | N/A
 UD  | {DW,EW,MJ,RU,WJ,YJ,ZM} |          2 |  2177 | N/A
 DW  | {CO,EW,MJ,RU,WJ,YJ,ZM} |          1 |  2176 | N/A
(10 rows)

Time: 1332.340 ms (00:01.332)
materialize=> 

Thoughts

WITH MUTUALLY RECURSIVE is obviously an unstoppable force of computer science that solved the first half of one of the Advent of code problems! But actually, the resulting SQL isn't all that complicated, and the resulting performance isn't especially grim either. I'll go as far as saying that it is pretty fair!

If any of you SQL folks were feeling left out of stuff like Advent of Code on account of the pain of writing these sorts of queries, rejoice! It's still up for a while (I think?) and you can go and stretch your limbs and see what you can cook up with just SQL and Materialize.