Skip to content

Conversation

@schneems
Copy link
Collaborator

@schneems schneems commented Jan 15, 2022

TLDR; I've been trying to optimize a "worst-case" perf scenario (9,000 line file) to get under my 1-second threshold. When I started, it was at ~2 seconds. After this last optimization, it's well under the threshold!

Before [main]:        1.22 seconds
After [this commit]:  0.67386 seconds

Cumulatively, this is 2.6x faster than 3.1.1 and 47.0x faster than 1.2.0.

Profiling before/after

Expansion before (called 2,291 times, 42.04% of overall time)
Parse before (called 3,315 for 31.16% of overall time)

Expansion after (called 654 times, 29.42% of overall time)
Parse after (called 1,507 times for 29.35% of overall time)

Note that parsing happens within the expansion method call, so while it seems "cheaper" to parse than expand based on this profiling data, it's the parsing that is making expansion slow.

Goal

Make the algorithm faster, so it doesn't timeout even when given a file with 9,000 lines.

Strategy (background)

There are two general strategies for improving dead_end perf:

  • Reduce/remove calls to Ripper (syntax error detection), as it is slow. For example, not calling Ripper if all code lines have been previously "hidden" (indicating valid code).
  • Improve computational complexity for non-ripper operations. For example, using a priority heap over sorting an array

We call Ripper for 2 cases. Both for individual code blocks to see if it contains a syntax error. We also call Ripper later on the whole document to see if that particular syntax error was making the document unparsable. Ripper is slower to parse more lines, so we only re-parse the entire document when we detect a new invalid block as a prior optimization.

If we can build "better" valid blocks, then we call Ripper fewer times on the overall document. If we can build larger blocks, we reduce the overall number of iterations for the algorithm. This property reduces Ripper calls and speeds up general "computational complexity" as there are N fewer operations to perform overall.

Heuristic Strategy

This PR builds on the concept that dead_end is a uniform cost search by adding a "heuristic" (think a-star search) when building blocks. At a high level, a heuristic is a quality measure that can also be incomplete. For a great explanation, see https://www.redblobgames.com/pathfinding/a-star/introduction.html.

What heuristic can we add? We know that if the code has an unbalanced pair of special characters, it cannot be valid. For example, this code { hello: "world" is unbalanced. It is missing a }. It contains a syntax error due to this imbalance.

In the dead_end code, we can count known special pairs using Ripper.lex output (which we already have). Here are the currently tracked pairs:

  • {}
  • ()
  • []
  • keyword/end

This information can be used as a heuristic. Code with unbalanced pairs always contains a syntax error, but balanced code may have a different (harder to detect) syntax error. The code's balance hints at how we should expand an existing code block.

For example, the code { hello: "world" must expand towards the right/down to be valid. This code would be known as "left-leaning" as it is heavier on the left side.

Rather than searching in an arbitrary direction, the heuristic determines the most sensible direction to expand.

Previously, we expanded blocks by scanning up and down, counting keyword/end pairs (outside of the block), and looking at indentation. This process was adequate but required that we take small steps and produce small blocks. It also has no concept of if the code it holds is syntactically valid or not until a full Ripper parse is performed. That combination means it can produce invalid code blocks (which can be desirable when hunting for invalid code). But when we want blocks to be valid, we can make more efficient moves.

Implementation: LexDiffPair helper class

I introduced a new class, LexDiffPair, to accommodate a heuristic expansion. It tracks pairs of characters that we care about {}, (), [], and keyword/end. The heuristic algorithm uses this data to inform block generation. Since the pair data is expected (and later benchmarked to confirm) to be a hotspot, it's implementation needs to be optimized. This section describes the implementation of the LexDiffPair class.

A LexDiffPair can be in one of several states:

  • Balanced (:equal)
  • Leaning left (:left)
  • Leaning right (:right)
  • Leaning both (:both)

An example of code line leaning both ways would be ) do |x|. Here the ) is leaning right (needs to expand up) while the do is leaning left (needs to expand down).

Each code line generates its own LexDiffPair.

Internally the information is stored as a diff of a count of each of the above pairs. A positive number indicates left-leaning, a negative number indicates right-leaning, a zero is balanced. This format allows for fast concatenation and comparison. It allows us to encode a lot of information in a comparatively small space.

The data was originally stored in a hash, then as an array. Ultimately using instance variables in a class was found to be the fastest option.

BalanceHeuristicExpand

When a block is passed to BalanceHeuristicExpand, a LexPairDiff is created from all of its lines. The code's balance is then used to determine which direction to expand.

Heuristic states

The main algorithm functions as a state machine based on which way the given code block is "leaning".

  • left or right: If the code is leaning left or right, the main goal is to get it back into balance. Accumulate lines in the correct direction until the overall block changes its leaning.

  • equal: If the code is already balanced, try to expand it while maintaining balance. For example, add already balanced code and any above/below pair of lines that cancel out. Finally, intentionally knock it off-balance by expanding up and then recurse the process (to ideally re-balance it and then exit).

  • both: If the code is leaning both ways, try to grab any extra "free" (already balanced) lines. Then expand up and down since it must expand both ways to become valid. As this process is repeated, it should eventually find a match in one direction and then be leaning left or right, which will expand faster on the next iteration (after it is put back on the frontier and popped again later).

Example:

An invalid code block might come in like this:

       start_line: keyword.location.start_line,
       start_char: keyword.location.end_char + 1,
       end_line: last_node.location.end_line,
       end_char: last_node.location.end_char

And it could expand it to:

   if exceptions || variable
    RescueEx.new(
     exceptions: exceptions,
     variable: variable,
     location:
      Location.new(
       start_line: keyword.location.start_line,
       start_char: keyword.location.end_char + 1,
       end_line: last_node.location.end_line,
       end_char: last_node.location.end_char
      )
    )
   end

Note it would expand it quite a bit more, but I'm abbreviating here. Here's the complete process across several expansions https://gist.github.com/schneems/e62216b610a36a81a98d9d8146b0611a

When to use heuristic expand

After the expansion algorithm was performing as desired it still needs to be used in the overall search algorithm.

After experimentation, the best place to use this new expansion technique was when an existing invalid block comes off of the frontier. The heuristic algorithm tends to correct such blocks and balance them out. When a block is "corrected" in this way, it reduces the overall number of times the document must be passed to Ripper (extremely slow). Also, since larger blocks reduce the overall iteration, we try to expand several times (while preserving balance) and take the largest valid expansion.

We use the original indentation-based expansion for code blocks that are already valid. The downside of using a heuristic that preserves balance is that ultimately we want the algorithm to generate invalid blocks. The original expansion can produce invalid expansions, which is useful.

There is a separate process for adding unvisited lines to the frontier (rather than expanding existing blocks). Unvisited lines are not a good candidate for heuristic expansion as it works a little too well. If we only add "unbalanced" code in an added block, we lose some of the context we desire (see comments for more details in parse_blocks_from_indent_line).

Concerns

One concern with this implementation is that calling the heuristic expansion three times was the only way to produce valid results. I'm not sure why. It might be an undiscovered property of the system, or perhaps all of my code examples to date are somehow biased in a specific way. The way to get more information is to put it out into the world and get feedback.

Another concern is that there are now two expansion classes. Or three if you count parse_blocks_from_indent_line. It's not always clear why you would choose one over another except that "it provides the best outcome".
It might be possible to simplify some of this logic or remove or combine competing expansion methods. Hopefully, patterns will emerge pointing to opportunities to guide that usage.

TLDR; I've been trying to optimize a "worst-case" perf scenario (9,000 line file) to get under my 1-second threshold. When I started, it was at ~2 seconds. After this last optimization, it's well under the threshold!

```
Before [main]:        1.22 seconds
After [this commit]:  0.67386 seconds
```

> Cumulatively, this is 2.6x faster than 3.1.1 and 47x faster than 1.2.0.

## Profiling before/after
Expansion before (called 2,291 times, 42.04% of overall time)
Parse before (called 3,315 for 31.16% of overall time)
![](https://www.dropbox.com/s/brw7ix5b0mhwy1z/Screen%20Shot%202022-01-14%20at%208.54.31%20PM.png?raw=1)
![](https://www.dropbox.com/s/8mx20auvod5wb8t/Screen%20Shot%202022-01-15%20at%201.10.41%20PM.png?raw=1)

Expansion after (called 654 times, 29.42% of overall time)
Parse after (called 1,507 times for 29.35% of overall time)
![](https://www.dropbox.com/s/3rmtpfk315ge7e6/Screen%20Shot%202022-01-14%20at%208.55.45%20PM.png?raw=1)

> Note that parsing happens within the expansion method call, so while it seems "cheaper" to parse than expand based on this profiling data, it's the parsing that is making expansion slow.

## Goal

Make the algorithm faster, so it doesn't timeout even when given a file with 9,000 lines.

## Strategy (background)

There are two general strategies for improving dead_end perf:

- Reduce/remove calls to Ripper (syntax error detection), as it is slow. For example, not calling Ripper if all code lines have been previously "hidden" (indicating valid code).
- Improve computational complexity for non-ripper operations. For example, using a priority heap over sorting an array

We call Ripper for 2 cases. Both for individual code blocks to see if it contains a syntax error. We also call Ripper later on the whole document to see if that particular syntax error was making the document unparsable. Ripper is slower to parse more lines, so we only re-parse the entire document when we detect a new invalid block as a prior optimization.

If we can build "better" valid blocks, then we call Ripper fewer times on the overall document. If we can build larger blocks, we reduce the overall number of iterations for the algorithm. This property reduces Ripper calls and speeds up general "computational complexity" as there are N fewer operations to perform overall.

## Approach

This approach builds on the concept that dead_end is a uniform cost search by adding a "heuristic" (think a-star search) when building blocks. At a high level, a heuristic is a quality measure that can also be incomplete. For a great explanation, see https://www.redblobgames.com/pathfinding/a-star/introduction.html.

What heuristic can we add? We know that if the code has an unbalanced pair of special characters, it cannot be valid. For example, this code `{ hello: "world"` is unbalanced. It is missing a `}`. It contains a syntax error due to this imbalance.

In the dead_end code, we can count known special pairs using `Ripper.lex` output (which we already have). Here are the currently tracked pairs:

- `{}`
- `()`
- `[]`
- keyword/end

This information can be used as a heuristic. Code with unbalanced pairs always contains a syntax error, but balanced code may have a different (harder to detect) syntax error. The code's balance hints at how we should expand an existing code block.

For example, the code `{ hello: "world"` must expand towards the right/down to be valid. This code would be known as "left-leaning" as it is heavier on the left side.

Rather than searching in an arbitrary direction, the heuristic determines the most sensible direction to expand.

Previously, we expanded blocks by scanning up and down, counting keyword/end pairs (outside of the block), and looking at indentation. This process was adequate but required that we take small steps and produce small blocks. It also has no concept of if the code it holds is syntactically valid or not until a full Ripper parse is performed. That combination means it can produce invalid code blocks (which can be desirable when hunting for invalid code). But when we want blocks to be valid, we can make more efficient moves.

## Implementation: LexDiffPair helper class

I introduced a new class, `LexDiffPair`, to accommodate a heuristic expansion. A `LexDiffPair` can be in one of several states:

- Balanced (:equal)
- Leaning left (:left)
- Leaning right (:right)
- Leaning both (:both)

> An example of code line leaning both ways would be `) do |x|`. Here the `)` is leaning right (needs to expand up) while the `do` is leaning left (needs to expand down).

Each code line generates its own `LexDiffPair`.

Internally the information is stored as a diff of a count of each of the above pairs. A positive number indicates left-leaning, a negative number indicates right-leaning, a zero is balanced. This format allows for fast concatenation and comparison.

## BalanceHeuristicExpand

When a block is passed to `BalanceHeuristicExpand`, a `LexPairDiff` is created from all of its lines. The code's balance is then used to determine which direction to expand.

If the code is leaning left or right, the main goal is to get it back into balance. Continue to expand it until balance is achieved.

If the code is already balanced, try to expand it while maintaining balance. For example, add already balanced code and any above/below pair of lines that cancel out. Finally, intentionally knock it off-balance by expanding up and then recurse the process (to ideally re-balance it and then exit).

If the code is leaning both ways, try to grab any extra "free" (already balanced) lines. Then expand up and down since it must expand both ways to become valid. As this process is repeated, it should eventually find a match in one direction and then be leaning left or right, which will expand faster on the next iteration (after it is put back on the frontier and popped again later).

Example:

An invalid code block might come in like this:

```
       start_line: keyword.location.start_line,
       start_char: keyword.location.end_char + 1,
       end_line: last_node.location.end_line,
       end_char: last_node.location.end_char
```

And it could expand it to:

```
   if exceptions || variable
    RescueEx.new(
     exceptions: exceptions,
     variable: variable,
     location:
      Location.new(
       start_line: keyword.location.start_line,
       start_char: keyword.location.end_char + 1,
       end_line: last_node.location.end_line,
       end_char: last_node.location.end_char
      )
    )
   end
```

> Note it would expand it quite a bit more, but I'm abbreviating here. Here's the complete process across several expansions https://gist.github.com/schneems/e62216b610a36a81a98d9d8146b0611a

## When to use heuristic expand

After experimentation, the best place to use this new expansion technique was when an existing invalid block comes off of the frontier. This algorithm tends to correct such blocks and balance them out. When a block is "corrected" in this way, it reduces the overall number of times the document must be passed to Ripper (extremely slow). Also, since larger blocks reduce the overall iteration, we try to expand several times (while preserving balance) and take the largest valid expansion.

We use the original indentation-based expansion for code blocks that are already valid. The downside of using a heuristic that preserves balance is that ultimately we want the algorithm to generate invalid blocks. The original expansion can produce invalid expansions, which is useful.

There is a separate process for adding unvisited lines to the frontier (rather than expanding existing blocks). Unvisited lines are not a good candidate for heuristic expansion as it works a little too well. If we only add "unbalanced" code in an added block, we lose some of the context we desire (see comments for more details in `parse_blocks_from_indent_line`).

## Concerns

One concern with this implementation is that calling the heuristic expansion three times was the only way to produce valid results. I'm not sure why. It might be an undiscovered property of the system, or perhaps all of my code examples to date are somehow biased in a specific way. The way to get more information is to put it out into the world and get feedback.

Another concern is that there are now two expansion classes. Or three if you count `parse_blocks_from_indent_line`. It's not always clear why you would choose one over another except that "it provides the best outcome".
It might be possible to simplify some of this logic or remove or combine competing expansion methods. Hopefully, patterns will emerge pointing to opportunities to guide that usage.
@schneems schneems force-pushed the schneems/balanced-lex-expansion branch from 1119247 to a21f401 Compare January 17, 2022 02:09
@schneems
Copy link
Collaborator Author

schneems commented Feb 7, 2022

This more or less works due to accident. I don't feel comfortable merging it. However this work did serve as useful experience for another perf optimization that I think can be merged over at branch https://github.com/zombocom/dead_end/tree/schneems/block-node-attempt-2 (which I will extract and clean up in another branch)

@schneems schneems closed this Feb 7, 2022
schneems added a commit that referenced this pull request Feb 7, 2022
I used this as a starting point for work on building a tree as it has the LeftRightPairDiff internals I wanted. Basically everything below this is true in isolation but false in its entirety. I ended up not using the BalanceHeuristicExpand as a concept #129.

TLDR; I've been trying to optimize a "worst-case" perf scenario (9,000 line file) to get under my 1-second threshold. When I started, it was at ~2 seconds. After this last optimization, it's well under the threshold!

```
Before [main]:        1.22 seconds
After [this commit]:  0.67386 seconds
```

> Cumulatively, this is 2.6x faster than 3.1.1 and 47x faster than 1.2.0.

## Profiling before/after
Expansion before (called 2,291 times, 42.04% of overall time)
Parse before (called 3,315 for 31.16% of overall time)
![](https://www.dropbox.com/s/brw7ix5b0mhwy1z/Screen%20Shot%202022-01-14%20at%208.54.31%20PM.png?raw=1)
![](https://www.dropbox.com/s/8mx20auvod5wb8t/Screen%20Shot%202022-01-15%20at%201.10.41%20PM.png?raw=1)

Expansion after (called 654 times, 29.42% of overall time)
Parse after (called 1,507 times for 29.35% of overall time)
![](https://www.dropbox.com/s/3rmtpfk315ge7e6/Screen%20Shot%202022-01-14%20at%208.55.45%20PM.png?raw=1)

> Note that parsing happens within the expansion method call, so while it seems "cheaper" to parse than expand based on this profiling data, it's the parsing that is making expansion slow.

## Goal

Make the algorithm faster, so it doesn't timeout even when given a file with 9,000 lines.

## Strategy (background)

There are two general strategies for improving dead_end perf:

- Reduce/remove calls to Ripper (syntax error detection), as it is slow. For example, not calling Ripper if all code lines have been previously "hidden" (indicating valid code).
- Improve computational complexity for non-ripper operations. For example, using a priority heap over sorting an array

We call Ripper for 2 cases. Both for individual code blocks to see if it contains a syntax error. We also call Ripper later on the whole document to see if that particular syntax error was making the document unparsable. Ripper is slower to parse more lines, so we only re-parse the entire document when we detect a new invalid block as a prior optimization.

If we can build "better" valid blocks, then we call Ripper fewer times on the overall document. If we can build larger blocks, we reduce the overall number of iterations for the algorithm. This property reduces Ripper calls and speeds up general "computational complexity" as there are N fewer operations to perform overall.

## Approach

This approach builds on the concept that dead_end is a uniform cost search by adding a "heuristic" (think a-star search) when building blocks. At a high level, a heuristic is a quality measure that can also be incomplete. For a great explanation, see https://www.redblobgames.com/pathfinding/a-star/introduction.html.

What heuristic can we add? We know that if the code has an unbalanced pair of special characters, it cannot be valid. For example, this code `{ hello: "world"` is unbalanced. It is missing a `}`. It contains a syntax error due to this imbalance.

In the dead_end code, we can count known special pairs using `Ripper.lex` output (which we already have). Here are the currently tracked pairs:

- `{}`
- `()`
- `[]`
- keyword/end

This information can be used as a heuristic. Code with unbalanced pairs always contains a syntax error, but balanced code may have a different (harder to detect) syntax error. The code's balance hints at how we should expand an existing code block.

For example, the code `{ hello: "world"` must expand towards the right/down to be valid. This code would be known as "left-leaning" as it is heavier on the left side.

Rather than searching in an arbitrary direction, the heuristic determines the most sensible direction to expand.

Previously, we expanded blocks by scanning up and down, counting keyword/end pairs (outside of the block), and looking at indentation. This process was adequate but required that we take small steps and produce small blocks. It also has no concept of if the code it holds is syntactically valid or not until a full Ripper parse is performed. That combination means it can produce invalid code blocks (which can be desirable when hunting for invalid code). But when we want blocks to be valid, we can make more efficient moves.

## Implementation: LexDiffPair helper class

I introduced a new class, `LexDiffPair`, to accommodate a heuristic expansion. A `LexDiffPair` can be in one of several states:

- Balanced (:equal)
- Leaning left (:left)
- Leaning right (:right)
- Leaning both (:both)

> An example of code line leaning both ways would be `) do |x|`. Here the `)` is leaning right (needs to expand up) while the `do` is leaning left (needs to expand down).

Each code line generates its own `LexDiffPair`.

Internally the information is stored as a diff of a count of each of the above pairs. A positive number indicates left-leaning, a negative number indicates right-leaning, a zero is balanced. This format allows for fast concatenation and comparison.

## BalanceHeuristicExpand

When a block is passed to `BalanceHeuristicExpand`, a `LexPairDiff` is created from all of its lines. The code's balance is then used to determine which direction to expand.

If the code is leaning left or right, the main goal is to get it back into balance. Continue to expand it until balance is achieved.

If the code is already balanced, try to expand it while maintaining balance. For example, add already balanced code and any above/below pair of lines that cancel out. Finally, intentionally knock it off-balance by expanding up and then recurse the process (to ideally re-balance it and then exit).

If the code is leaning both ways, try to grab any extra "free" (already balanced) lines. Then expand up and down since it must expand both ways to become valid. As this process is repeated, it should eventually find a match in one direction and then be leaning left or right, which will expand faster on the next iteration (after it is put back on the frontier and popped again later).

Example:

An invalid code block might come in like this:

```
       start_line: keyword.location.start_line,
       start_char: keyword.location.end_char + 1,
       end_line: last_node.location.end_line,
       end_char: last_node.location.end_char
```

And it could expand it to:

```
   if exceptions || variable
    RescueEx.new(
     exceptions: exceptions,
     variable: variable,
     location:
      Location.new(
       start_line: keyword.location.start_line,
       start_char: keyword.location.end_char + 1,
       end_line: last_node.location.end_line,
       end_char: last_node.location.end_char
      )
    )
   end
```

> Note it would expand it quite a bit more, but I'm abbreviating here. Here's the complete process across several expansions https://gist.github.com/schneems/e62216b610a36a81a98d9d8146b0611a

## When to use heuristic expand

After experimentation, the best place to use this new expansion technique was when an existing invalid block comes off of the frontier. This algorithm tends to correct such blocks and balance them out. When a block is "corrected" in this way, it reduces the overall number of times the document must be passed to Ripper (extremely slow). Also, since larger blocks reduce the overall iteration, we try to expand several times (while preserving balance) and take the largest valid expansion.

We use the original indentation-based expansion for code blocks that are already valid. The downside of using a heuristic that preserves balance is that ultimately we want the algorithm to generate invalid blocks. The original expansion can produce invalid expansions, which is useful.

There is a separate process for adding unvisited lines to the frontier (rather than expanding existing blocks). Unvisited lines are not a good candidate for heuristic expansion as it works a little too well. If we only add "unbalanced" code in an added block, we lose some of the context we desire (see comments for more details in `parse_blocks_from_indent_line`).

## Concerns

One concern with this implementation is that calling the heuristic expansion three times was the only way to produce valid results. I'm not sure why. It might be an undiscovered property of the system, or perhaps all of my code examples to date are somehow biased in a specific way. The way to get more information is to put it out into the world and get feedback.

Another concern is that there are now two expansion classes. Or three if you count `parse_blocks_from_indent_line`. It's not always clear why you would choose one over another except that "it provides the best outcome".
It might be possible to simplify some of this logic or remove or combine competing expansion methods. Hopefully, patterns will emerge pointing to opportunities to guide that usage.
schneems added a commit that referenced this pull request Feb 7, 2022
I wanted to start with this test case as it has a property that causes it to fail even when most/all other
test cases pass. It's the lone holdover that resisted my initial attempts to add a "heuristic expansion"
in #129.
schneems added a commit that referenced this pull request Jun 4, 2022
I used this as a starting point for work on building a tree as it has the LeftRightPairDiff internals I wanted. Basically everything below this is true in isolation but false in its entirety. I ended up not using the BalanceHeuristicExpand as a concept #129.

TLDR; I've been trying to optimize a "worst-case" perf scenario (9,000 line file) to get under my 1-second threshold. When I started, it was at ~2 seconds. After this last optimization, it's well under the threshold!

```
Before [main]:        1.22 seconds
After [this commit]:  0.67386 seconds
```

> Cumulatively, this is 2.6x faster than 3.1.1 and 47x faster than 1.2.0.

## Profiling before/after
Expansion before (called 2,291 times, 42.04% of overall time)
Parse before (called 3,315 for 31.16% of overall time)
![](https://www.dropbox.com/s/brw7ix5b0mhwy1z/Screen%20Shot%202022-01-14%20at%208.54.31%20PM.png?raw=1)
![](https://www.dropbox.com/s/8mx20auvod5wb8t/Screen%20Shot%202022-01-15%20at%201.10.41%20PM.png?raw=1)

Expansion after (called 654 times, 29.42% of overall time)
Parse after (called 1,507 times for 29.35% of overall time)
![](https://www.dropbox.com/s/3rmtpfk315ge7e6/Screen%20Shot%202022-01-14%20at%208.55.45%20PM.png?raw=1)

> Note that parsing happens within the expansion method call, so while it seems "cheaper" to parse than expand based on this profiling data, it's the parsing that is making expansion slow.

## Goal

Make the algorithm faster, so it doesn't timeout even when given a file with 9,000 lines.

## Strategy (background)

There are two general strategies for improving dead_end perf:

- Reduce/remove calls to Ripper (syntax error detection), as it is slow. For example, not calling Ripper if all code lines have been previously "hidden" (indicating valid code).
- Improve computational complexity for non-ripper operations. For example, using a priority heap over sorting an array

We call Ripper for 2 cases. Both for individual code blocks to see if it contains a syntax error. We also call Ripper later on the whole document to see if that particular syntax error was making the document unparsable. Ripper is slower to parse more lines, so we only re-parse the entire document when we detect a new invalid block as a prior optimization.

If we can build "better" valid blocks, then we call Ripper fewer times on the overall document. If we can build larger blocks, we reduce the overall number of iterations for the algorithm. This property reduces Ripper calls and speeds up general "computational complexity" as there are N fewer operations to perform overall.

## Approach

This approach builds on the concept that dead_end is a uniform cost search by adding a "heuristic" (think a-star search) when building blocks. At a high level, a heuristic is a quality measure that can also be incomplete. For a great explanation, see https://www.redblobgames.com/pathfinding/a-star/introduction.html.

What heuristic can we add? We know that if the code has an unbalanced pair of special characters, it cannot be valid. For example, this code `{ hello: "world"` is unbalanced. It is missing a `}`. It contains a syntax error due to this imbalance.

In the dead_end code, we can count known special pairs using `Ripper.lex` output (which we already have). Here are the currently tracked pairs:

- `{}`
- `()`
- `[]`
- keyword/end

This information can be used as a heuristic. Code with unbalanced pairs always contains a syntax error, but balanced code may have a different (harder to detect) syntax error. The code's balance hints at how we should expand an existing code block.

For example, the code `{ hello: "world"` must expand towards the right/down to be valid. This code would be known as "left-leaning" as it is heavier on the left side.

Rather than searching in an arbitrary direction, the heuristic determines the most sensible direction to expand.

Previously, we expanded blocks by scanning up and down, counting keyword/end pairs (outside of the block), and looking at indentation. This process was adequate but required that we take small steps and produce small blocks. It also has no concept of if the code it holds is syntactically valid or not until a full Ripper parse is performed. That combination means it can produce invalid code blocks (which can be desirable when hunting for invalid code). But when we want blocks to be valid, we can make more efficient moves.

## Implementation: LexDiffPair helper class

I introduced a new class, `LexDiffPair`, to accommodate a heuristic expansion. A `LexDiffPair` can be in one of several states:

- Balanced (:equal)
- Leaning left (:left)
- Leaning right (:right)
- Leaning both (:both)

> An example of code line leaning both ways would be `) do |x|`. Here the `)` is leaning right (needs to expand up) while the `do` is leaning left (needs to expand down).

Each code line generates its own `LexDiffPair`.

Internally the information is stored as a diff of a count of each of the above pairs. A positive number indicates left-leaning, a negative number indicates right-leaning, a zero is balanced. This format allows for fast concatenation and comparison.

## BalanceHeuristicExpand

When a block is passed to `BalanceHeuristicExpand`, a `LexPairDiff` is created from all of its lines. The code's balance is then used to determine which direction to expand.

If the code is leaning left or right, the main goal is to get it back into balance. Continue to expand it until balance is achieved.

If the code is already balanced, try to expand it while maintaining balance. For example, add already balanced code and any above/below pair of lines that cancel out. Finally, intentionally knock it off-balance by expanding up and then recurse the process (to ideally re-balance it and then exit).

If the code is leaning both ways, try to grab any extra "free" (already balanced) lines. Then expand up and down since it must expand both ways to become valid. As this process is repeated, it should eventually find a match in one direction and then be leaning left or right, which will expand faster on the next iteration (after it is put back on the frontier and popped again later).

Example:

An invalid code block might come in like this:

```
       start_line: keyword.location.start_line,
       start_char: keyword.location.end_char + 1,
       end_line: last_node.location.end_line,
       end_char: last_node.location.end_char
```

And it could expand it to:

```
   if exceptions || variable
    RescueEx.new(
     exceptions: exceptions,
     variable: variable,
     location:
      Location.new(
       start_line: keyword.location.start_line,
       start_char: keyword.location.end_char + 1,
       end_line: last_node.location.end_line,
       end_char: last_node.location.end_char
      )
    )
   end
```

> Note it would expand it quite a bit more, but I'm abbreviating here. Here's the complete process across several expansions https://gist.github.com/schneems/e62216b610a36a81a98d9d8146b0611a

## When to use heuristic expand

After experimentation, the best place to use this new expansion technique was when an existing invalid block comes off of the frontier. This algorithm tends to correct such blocks and balance them out. When a block is "corrected" in this way, it reduces the overall number of times the document must be passed to Ripper (extremely slow). Also, since larger blocks reduce the overall iteration, we try to expand several times (while preserving balance) and take the largest valid expansion.

We use the original indentation-based expansion for code blocks that are already valid. The downside of using a heuristic that preserves balance is that ultimately we want the algorithm to generate invalid blocks. The original expansion can produce invalid expansions, which is useful.

There is a separate process for adding unvisited lines to the frontier (rather than expanding existing blocks). Unvisited lines are not a good candidate for heuristic expansion as it works a little too well. If we only add "unbalanced" code in an added block, we lose some of the context we desire (see comments for more details in `parse_blocks_from_indent_line`).

## Concerns

One concern with this implementation is that calling the heuristic expansion three times was the only way to produce valid results. I'm not sure why. It might be an undiscovered property of the system, or perhaps all of my code examples to date are somehow biased in a specific way. The way to get more information is to put it out into the world and get feedback.

Another concern is that there are now two expansion classes. Or three if you count `parse_blocks_from_indent_line`. It's not always clear why you would choose one over another except that "it provides the best outcome".
It might be possible to simplify some of this logic or remove or combine competing expansion methods. Hopefully, patterns will emerge pointing to opportunities to guide that usage.
schneems added a commit that referenced this pull request Jun 4, 2022
I wanted to start with this test case as it has a property that causes it to fail even when most/all other
test cases pass. It's the lone holdover that resisted my initial attempts to add a "heuristic expansion"
in #129.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants