Skip to content

Latest commit

 

History

History
137 lines (73 loc) · 34.4 KB

index.md

File metadata and controls

137 lines (73 loc) · 34.4 KB

I set a goal for myself to solve this year's Advent of Code (2023) in under a second using Rust. This has been done by other people in prior years, and I know there were others out there this year who were similarly optimizing their solutions. In particular, I found some like-minded developers in Chris Biscardi's Discord server who were also optimizing AOC2023 using Rust. Many thanks to those who shared their ideas!

In the end, I managed to solve all of 2023 in just under 31ms (solutions benchmarked separately and then summed together). The two main challenges along the way were finding the right algorithms and data structures, and then optimizing my implementations. The latter was more important than I realized up front; I was often able to achieve a 5x or better speed improvement with no algorithmic changes by optimizing the implementation. I'll discuss some of the general optimization techniques first, then highlight some of the algorithmic choices for individual days.

Ground Rules

  1. 100% safe Rust1. Part of my goal was to learn optimization techniques I could apply in day-to-day usage. On certain days optimizations were available to use unsafe memory access to skip bounds checking, but these are not optimizations I would use on anything but the most performance-sensitive code, and even there I would think twice before using unsafe.
  2. Solutions should work for any users input. Some days (8, 20, and 21) involved some reverse-engineering and guessing at valid assumptions about the input, so this rule was more of a guideline, but when possible I tried not to make undue assumptions about my input.

General Optimization Tips

  1. Memory is slow. Reducing allocations by reusing buffers or reorganizing data structures can make a huge difference. I used DHAT to profile allocations and find opportunities for improvements on many days.
  2. Memory is really slow. Some data structures may have better asymptotic performance, but often a plain array (or Vec) offers better practical performance, due to cache locality and other similar reasons.
  3. Hash sets and maps may have O(1) insertion and retrieval, but the hashing operations are expensive. This means the constant factors can be quite high. If the number of elements is small, a Vec can outperform a HashSet even if you have to scan every element on insertion and retrieval. Likewise, a HashSet<usize> can be replaced with Vec<bool> if the maximum value I am inserting is reasonably small. Swapping hash sets/maps for vectors often was a 2-3x speedup. If these optimizations were not possible (due to the domain of the items I am hashing being too large, for example), I found that fx-hash (a drop-in replacement for the built-in hash sets/maps) offered large performance gains.
  4. String operations in safe Rust have guardrails preventing you from shooting yourself in the foot on utf-8 inputs (not all characters in utf-8 are 1 byte wide!). However, I assumed that in AoC, all inputs would be encoded in ASCII. For problems where parsing was the bottleneck, operating on &[u8] instead of &str offered modest performance improvements (at the cost of ergonomics).
  5. Reduce instructions in the hot path. For instance, some days involved search algorithms where either DFS or BFS could be used interchangeably; on these days using a queue (VecDeque) for BFS involves more work than a stack (Vec) as the queue needs to handle the complexity of wrapping around in the underlying vector. Perhaps BFS is a more natural choice, but occasionally changing to DFS to use a Vec instead of VecDeque offered large gains. This nature of optimization is sometimes more art than science, as you need to keep in mind that the compiler is optimizing your code as well. I usually started with the simplest implementation and then benchmarked various ideas to improve on that. Some would pan out, others wouldn't.

Puzzle Summaries

Warning

Spoilers abound below. If you are still working on some of the puzzles and don't want to read spoilers, stop here!

I'll have a short commentary for each day's puzzle below, including what I think are some key insights to solve the puzzle quickly. The focus will be primarily on algorithmic choices, but if there was some important detail in the implementation I'll call that out too. I doubt that my solutions were optimal for every day. Better solutions almost certainly exist for some puzzles, such as day 10 where I hadn't heard of the algorithm that many people used to solve it. On other puzzles, there may have been multiple different approaches that all had similar performance characteristics. This isn't intended to be a comprehensive guide, but instead a brief summary of my solutions and approaches.

Part 1 was straightforward, but Part 2 was unexpectedly tricky for a day 1 puzzle. The potential for numbers to "overlap" in the input (such as "oneight") was the main challenge. Luckily, there is a linear approach that sidesteps the problem entirely: Keep a list of strings to match against (e.g. ["one", "two", ...]), and then for every character in the input, see if the substring starting at that character starts with one of those strings. Find the first and last match, and then you're done. Optimizations include early termination and searching from either end (to skip most of the middle of the string). While I didn't implement this, a trie or FST likely would have been even better for matching against the known set of strings.

This was mostly an exercise in parsing. I could have written a faster solution that skipped utf-8 validation and used byte arrays instead of strings, but I only started doing that in later days. The primary optimization I applied was solving each line as I went, eliminating the need for any memory allocations.

Another fairly straightforward puzzle (as expected in the early days). My solution scanned the input top to bottom and left to right looking for numbers. When a number was found and I encountered the first non-numeric character following the number (or the end of the row!), I looked at all the characters surrounding the number to find symbols. From there, the rest of the solution was straightforward. There was a possible optimization I didn't pursue on part 2 where I looked for gears instead and only scanned for numbers that touched the gears. This seemed like it would reduce the amount of work performed a fair amount, but also was more complicated and not needed for the ~40-50µs it may have saved.

Part 1 was another exercise in parsing, but part 2 was a nice twist. I ended up keeping a vector that denoted how many copies of each scratch card existed. If scratch card i had n winning numbers and c copies in the vector, then I would add c to the i+1 through i+nth elements of the vector. The answer was the sum of each entry in that vector at the end.

This was the first "hard" day of the year. Brute force was reasonable for part 1, but too slow2 for part 2. The key to part 2 was to operate on ranges directly, rather than keeping track of individual seeds. Each pair of numbers in the seed input would result in a range of seeds. Then each range within a map applied to a seed range could result in a maximum of 3 output seed ranges (some portion of the range that comes before the mapping, the mapped portion, and whatever is left afterwards). Because of this, the total number of seed ranges after applying all the mappings is linear in the size of the input. My solution keeps track of a vector of seed ranges (starting with a single element), and applies mappings one by one to expand the vector of seed ranges. At the end, I can take the minimum "start" value of the ranges.

An interesting puzzle with a constant time mathematical solution. Solving the equations of motion for the boat and setting up an inequality to beat the best distance from the other boats results in a quadratic equation as a function of "charge time". The quadratic formula provides the answer after that.

Nothing fancy for day 7, but getting a working solution involved sidestepping some potential pitfalls from edge cases. One implementation detail that I used to optimize the solution was recognizing that a hand could only contain 5 cards, so there was no need for a dynamic vector. Instead, a fixed-size array (which in Rust can be stack allocated) was sufficient. Thus, the only allocation was collecting all the hands into a vector to sort them.

The first "reverse engineering of the input" day! The solution hinged on recognizing that every ghost would eventually enter a cycle (and if a solution existed, their cycle would contain one of the destinations3). Most people (including myself) assumed that the length of the cycle was the same as the number of steps to reach the cycle, and took the LCM of the latter values. A more general solution (which I later coded up as "part 3") would allow the starting offsets to differ from the cycle length (requiring the Chinese Remainder Theorem to solve). The most general solution (which I didn't bother with) would allow for multiple destinations within a ghosts cycle. This would result in each ghost having multiple possible starting offsets and a single cycle length. Trying all permutations of the starting offsets seems like it would work, but a better general solution may exist.

It seemed like there might have been some mathematical approach to this puzzle, but I simply calculated the answer directly as per the instructions. I calculated the differences in place to avoid unnecessary allocations, and recognized that the answer was simply the sum of the final difference for each iteration (which I could keep a running total for as I went). Part 2 was the same, but with the numbers reversed. Slapping on a .rev() did the trick!

Part 1 was a slog of parsing and special cases, but nothing overly difficult. I made my life easier by inferring what pipe the starting position must have been, making it trivial to walk along the loop and count the steps until I returned to the start. Part 2 was a larger challenge. I later found out that something called the Shoelace formula exists which can easily calculate the area of a closed polygon (such as exists in this puzzle), so long as you are careful about self overlaps. At the time, though, I had never heard of this. Therefore, I implemented flood fill. It was a giant mess, but I was careful about keeping track of the "inside" region so as to make the flood fill efficient. Later I saw someone on Reddit talking about line scanning, and it clicked that I know the cell that would be one column to the left of the first column is always in the exterior region, and every time you cross a pipe you flip from outside to inside and vice versa. You can scan column by column and row by row and add up all the cells that are inside the pipe loop by this method. I tried that, got the wrong answer, and then sat down with a pen and paper until I realized that you have to exclude pipes turning in one of the two of the north/south directions for it all to work out. The result was much shorter (and over 2x faster, to boot)!

Another day where part 1 is straightforward, but the numbers are "too large" to brute force part 2. My solution kept track of how many galaxies were in each row and column, then walked through the rows (and likewise for the columns) to produce row and column coordinates for each galaxy, adding a running offset as I went to account for the expansion. The trick I applied beyond this was using a linear solution to find the sum of distances rather than quadratic, after applying some algebra and iterating over the galaxies in a careful order to make sure that the coordinates were sorted implicitly.

Brute force worked for part 1, but dynamic programming is required for part 2. It was a particularly tricky problem to solve, with a complex recurrence relation. The natural solution involved keeping track of the length of the current run of damaged springs when recursing, but this increased the state space (and reduced the effectiveness of memoization). My solution involved changing the recurrence to implicitly track whether runs of damaged springs matched the required counts as well as searching bottom-up iteratively instead of top-down recursively.

I didn't use any special tricks for this puzzle. Part 2 was very similar to part 1, except it involved finding a mirroring where "error count" is 1 instead of 0. I performed the search in a way that would early terminate after the second error was found (in Rust, this was with an iterator that counted column errors for a row split where I used `.take(2).count() == 1; the second error would let us fail immediately).

Another "find the cycle" type of puzzle. I optimized the operation of a cycle of north/west/south east tilts by precomputing how far each cell is from a cubed rock in each direction. I kept a vector of the coordinates of the round rocks, and each tilt I walked through the list of rocks to figure out where it would end up. This involved looking up the location of the closest cubed rock in the direction of the tilt, and then looking up how many round rocks were already stacked on top of that cubed rock. Using that information, you could directly compute where the current round rock will end up. This approach lended itself well to computing the load. One possibly shady optimization I made was using the last 4 load values as the cache key (since I could compact them into a u128 integer). In general, this was likely not sound (it's easy to construct a counter-example where it will fail), but it also seemed like it would likely work for everyones input. The alternative would be to cache on the current state, but that would require either keeping the rocks sorted or otherwise consistently serializing the state (or changing the approach to use a dense grid instead of sparse). It was about 5ms slower to serialize the state to a string and cache on that instead.

Nothing special algorithmic for this puzzle, but one notable optimization was the usage of a vector instead of a linked list for each "box". Yes, linked lists have O(1) removal, and vectors have O(N) removal, but vectors are still faster here. I initially wrote a linked list backed by a vector, but when I switched to a plain vector it was 2x faster.

Part 2 was just a loop around Part 1, so almost all of the optimization efforts were around making Part 1 faster. The exception to this was adding parallelism to the loop in Part 2, which in Rust is very easy with Rayon. This was the first day where Rayon offered significant performance gains; with a single thread, part 2 took 8ms, whereas with Rayon it took 1.7ms. My first attempt took 100ms, though, so there were plenty of optimizations available. Algorithmically, the only optimization was recognizing that entering an energized splitter node guaranteed that you were on a cycle and could exit, removing the need to explicitly track cycles (2x speed improvement). Using a 2D grid instead of a hash set was another 3x improvement, and the rest was a lot of simplifying and streamlining (and some inlining or unrolling of logic) to reduce duplicated or unnecessary work.

Day 17 was a rather fun path-finding puzzle. The primary obstacle to a speedy solution is representing the graph in a way that makes the search efficient. The most obvious way (to me) to represent the graph is to have a vertex represent not just the coordinates in the grid, but also the direction being traveled and how far in a straight line the crucible has already traveled. While this would work, the search space would be quite large. Instead of expanding the vertices in this manner, I decided to change how edges worked. I observed that each move could be viewed as a turn and then traveling straight for some distance (1-3 in part 1, and 4-10 in part 2). Each of these turn-and-go-straight moves became an edge, and the vertices were just the simple 2D coordinates. With this representation, Dijkstra's algorithm was rather straightforward to implement and worked great on both parts. One notable implementation optimization was the usage of a bucket queue instead of binary heap for the priority queue. This was nearly a 3x speed improvement, since as I had learned by now, vectors are just really fast. The other thing to note is that while it was only marginally faster in the end, I was determined to find an admissible heuristic that made A* faster than Dijkstra's, despite the problem seemingly being crafted to make A* not useful. I came up with a heuristic that properly accounted for the requirement to "zig-zag" for long straight distances, but I had to precalculate it for every position for A* to be faster in the end.

I was happy to be able to use the Shoelace formula that I learned about on Reddit after solving day 10. At first it seemed a little magical ("why should that even work?") but when I started trying alternate approaches to see if anything was faster, things started to click. At one point I threw out my shoelace solution and started solving from scratch with the perspective of integrating over the curve. I could see how if you treat the area under the curve as it moves right as positive, and negative as it moves left, that you would end up with the area of the closed curve in the end. And since all of our turns were right angles, the integral was as simple as summing up rectangles (which were easy to compute). Then when I went back to read Wikipedia's entry on the Shoelace formula, I realized that this was pretty much the trapezoid derivation. Neat! In the end, the "line integral" approach was marginally faster than the standard Shoelace formula, due to exploiting all the right angles.

Day 19 was like day 5 cranked up to 11. That is, operating on ranges directly rather than "materializing" them was once again the key to a fast solution. The additional twist this time was that rather than a list of mappings to apply, the workflows formed a tree. I used DFS to search the tree to find the "volume" of all the 4D ranges that would be accepted and then calculated the total volume to find the answer.

Day 20 was the second "reverse engineering" puzzle that involved finding cycles and taking the LCM, this time relying on the cycle starting immediately. Except for this puzzle you also had to inspect the input and notice where to look for the cycles. Namely, assuming 4 conjunction nodes would lead into the sole conjunction node that leads into "rx". It all felt somewhat unsatisfying, but the actual implementation was quite fun. The only real optimization I used for this puzzle was to relabel the modules with sequential integers (in the order a label appeared in the input). This made the parsing slightly slower but paid off when calculating the solution as it turned slow hashing of strings into fast lookups in a vector. I used this technique for the first time on day 19 and would reuse it again on day 25.

The final "reverse engineering" puzzle of the year. I initially spent some time attempting to come up with a generalized mathematical solution before I looked and the input and decided that the open central vertical/horizontal corridors were probably in everyone's input and we were supposed to use that to simplify the problem. My approach to the problem was geometrical. First I noticed that each cell in a grid had an associated "parity"; it was possible to land on that square in either an even number of steps from the origin or an odd number. This can be seen by considering a closed loop from the start to a given square and back. Such a loop must always be an even number of steps (since you can pair a step in any direction with a second step in the opposite direction, which is needed to return to the origin). An odd number plus an even number is odd, which would result in an odd number of steps for a closed loop. The second part of my solution was noting that the parity would flip every time you crossed a tiling boundary (because the size of each side of the grid was odd). Finally, due to the empty corridors, it was straightforward to compute how many tilings of grids you would pass by walking a given number of steps in each cardinal direction. Due to the edges also being clear, you could also compute the area of the shape formed by all the tiles reachable from the start in a given number of steps (it forms a diamond). The caveat is that on the outer ring of tiles, there are some corners which end up as unreachable, and some corners of tiles not in the diamond that should be reachable (and these end up being opposite parity). The result is an equation that maps the number of tiles walked (with coefficients based on the number of gardens in each parity group as well as the counts of gardens in the corners in each parity group) to the total number of reachable garden tiles. An alternate approach for the day (which probably involved fewer assumptions and definitely involved less geometry and counting) was to simply brute force the solution for a number of steps that results in 0, 1, and 2 tilings. You can use the answer for those points, guessing that there is a quadratic equation at play (due to the nature of the 2D grid), to perform a Lagrange interpolation to find the coefficients to the polynomial. However, such an approach was by necessity slower due to needing to brute force up to 2 tilings.

My approach for both parts shared some common code to compute the end positions of the bricks after they finished falling. I first sorted the bricks by their Z component and then handled the bricks in order. I kept track of the "top view" of the tower, which held both the current height and the brick identifier at that height for each x,y position. Each new brick added to the tower needed to check its entire vertical cross-section (every (x,y) coordinate for the brick) to find the maximum height. This would become the new height for the brick after it finishes falling. Additionally, I kept two lists for each brick: the bricks on top of it, and the bricks that are supporting it underneath. When I find the height for a brick, I update the lists based on all the bricks at that height within the vertical cross-section. Part 1 was calculable directly from this result (find all bricks that only support bricks with a "supported" count of 2 or more). For part 2, I recognized that if you removed 1 brick, then any brick that was solely supported by that brick will start falling. Then any brick solely supported by those bricks would start falling, and so on. This formed a tree of bricks that needed to be searched; I used DFS. Additional optimizations were the likely suspects: parallelization with rayon, replacing hashmaps with vectors, parsing bytes instead of UTF-8 characters, etc.

Another pathfinding puzzle. Except this time, we had to find the longest path instead of the shortest. And in general, that's NP-hard (roughly meaning that calculating a solution will be slow for even moderately large inputs)4. Luckily for part 1, it's only NP-hard in general. If no cycles are possible in the graph, finding the longest path is the same as finding the shortest path after flipping all the edge weights to be negative5, which means we can again use Dijkstra's algorithm. Sadly, part 2 introduces cycles to the graph by removing the one-directional slopes, meaning the only solution will scale exponentially with the size of the input. That means we just need to make the input smaller! The key insight is that only the cells with 3 or more neighbors matter; any cell with exactly 2 neighbors can be collapsed into one of the junctions. Other than the start and goal, any cell with exactly 1 neighbor can be ignored entirely, cascading back to the nearest junction (however, this didn't occur in the input). This enabled compressing the graph into a relatively small number of vertices, letting a brute force search return in a reasonable amount of time.

Other than compressing the graph, there were 3 large optimizations that I used to improve the search time even more. First, I kept track of the vertices that had been visited (a necessary part of the search) with a vector of bools (later replaced by a 64-bit bit set, as it seemed reasonable to assume that there would be fewer than 64 vertices post-compression - any more and an exhaustive search would become prohibitive). I would set the entry to true for a vertex before recursing and false after recursing, meaning the entire search could share the same vector (avoiding allocations). The second major optimization was "trimming" the final path to the goal up to the last junction. While this only removed one edge from the compressed graph, it significantly reduced the search space by lowering the depth by one. The final optimization was to enable parallelization by performing an initial iterative DFS search up to a certain depth (I used depth 8) and then kicking off parallel recursive searches for all the states found at that depth. Day 23 might have seen the largest absolute improvement out of all the days from the original unoptimized version (250ms) to the final optimized version (13.3ms).

Math. Lots of math. Part 1 was a simple application of the equations of motion and line-line intersections (which I derived algebraically by hand because I forgot the formulas). Part 2 was a doozy though. All my initial attempts at an algebraic solution failed because I ended up with non-linear terms in my system of equations. While I know some just plugged it into a solver such as Z3 and called it a day, I didn't think that would count for my self-imposed ground rules. Eventually I realized that the non-linear terms in my equations only depended on the rock and didn't contain any coefficients from the hailstones. This meant I could "double up" by adding more hailstones and subtracting one equation from another to cancel out the non-linear terms.

The approach I used to solve the problem started by coming up with an equation of when the rock would hit the first hailstone: $p + vt = p_0 + v_0 t$ (where $p$ and $v$ are vectors for position and velocity). Rearranging the terms results in $(p - p_0) = -t(v - v_0)$, and since $-t$ is a scalar, the vectors on both sides must be parallel. Two parallel vectors will always have a cross product that is zero, so $(p - p_0) \times (v - v_0) = 0$. I later learned that various mathematical techniques could have been applied at this point to make my life easier, but instead, I manually expanded that cross-product into the scalar components (resulting in 3 bilinear equations equalling 0). I repeated with a second hailstone and subtracted those equations from the first equations. As noted earlier, this resulted in a linear system of 3 equations. I had six unknowns, so I had to repeat the whole process over with 2 more hailstones (I reused the first hailstone, so I only used 3 total) to get a system of 6 linear equations with 6 unknowns. Initially I used a library to solve this (ndarray-linalg with their OpenBLAS backend), but this not only caused long annoying compile times but also felt like it violated my safe rust goal. I ended up writing a Gaussian elimination implementation which let me drop the ndarray dependency.

A fun day to end on! The problem was almost, but not quite, the minimum cut of a flow network problem. The only problem is that the input wasn't a flow network; it was a plain undirected/unweighted graph. There were no source or sink nodes and no obvious way to introduce them (like one might in a bipartite graph to find the maximum matching). As it turns out, there is an algorithm for finding the minimum cut in general undirected graphs (not networks), but I didn't know about it. What I did know was how to find the minimum cut in a network using the Ford-Fulkerson algorithm. I had a hammer, so I was going to treat the problem as a nail.

The observation I had was that if I picked an arbitrary vertex as the source, then all I needed to do was find some vertex on the other side of the cut and treat it as the sink. Since the problem statement described that the cut would split the graph roughly in two, I had about a 50% chance of finding such a vertex if I just picked it randomly. So that's what I did! I iterated over the remaining vertices in random order and calculated the min-cut using that vertex as a sink. Since I knew the minimum cut was 3 edges (despite not knowing the edges), I could return early as soon as I saw that the cut would be 4 or larger. Since finding the min-cut in an unweighted requires a BFS search ($O(E)$, where $E$ is the number of edges) once for each cut, this made the worst-case runtime for the overall algorithm $O(VE)$, but the average case was more like $O(E)$ since the expected number of vertices we need to try is ~2. As it turns out, this is quite a bit faster than the aforementioned Stoer-Wagner algorithm I didn't know about. This worked primarily because I could heavily exploit the knowledge that the minimum cut was exactly 3 edges, eliminating the need for an exhaustive search.

Footnotes

  1. Some of my dependencies (like Rayon) use unsafe, but when possible I used crates that were 100% safe (like using tinyvec instead of smallvec).

  2. Brute force still worked to solve the puzzle, but it would have blown the runtime budget by a large margin.

  3. If a ghost's cycle didn't contain a destination, a solution still might exist before the ghost enters their cycle. But then the problem would be trivial.

  4. Yes, I know this is a vast oversimplification.

  5. This only works on acyclic graphs because Dijkstra's algorithm will run forever if there are cycles with negative edge weights.