-
Notifications
You must be signed in to change notification settings - Fork 14
Add BalanceHeuristicExpand, 1.7x faster search. #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
TLDR; I've been trying to optimize a "worst-case" perf scenario (9,000 line file) to get under my 1-second threshold. When I started, it was at ~2 seconds. After this last optimization, it's well under the threshold! ``` Before [main]: 1.22 seconds After [this commit]: 0.67386 seconds ``` > Cumulatively, this is 2.6x faster than 3.1.1 and 47x faster than 1.2.0. ## Profiling before/after Expansion before (called 2,291 times, 42.04% of overall time) Parse before (called 3,315 for 31.16% of overall time)   Expansion after (called 654 times, 29.42% of overall time) Parse after (called 1,507 times for 29.35% of overall time)  > Note that parsing happens within the expansion method call, so while it seems "cheaper" to parse than expand based on this profiling data, it's the parsing that is making expansion slow. ## Goal Make the algorithm faster, so it doesn't timeout even when given a file with 9,000 lines. ## Strategy (background) There are two general strategies for improving dead_end perf: - Reduce/remove calls to Ripper (syntax error detection), as it is slow. For example, not calling Ripper if all code lines have been previously "hidden" (indicating valid code). - Improve computational complexity for non-ripper operations. For example, using a priority heap over sorting an array We call Ripper for 2 cases. Both for individual code blocks to see if it contains a syntax error. We also call Ripper later on the whole document to see if that particular syntax error was making the document unparsable. Ripper is slower to parse more lines, so we only re-parse the entire document when we detect a new invalid block as a prior optimization. If we can build "better" valid blocks, then we call Ripper fewer times on the overall document. If we can build larger blocks, we reduce the overall number of iterations for the algorithm. This property reduces Ripper calls and speeds up general "computational complexity" as there are N fewer operations to perform overall. ## Approach This approach builds on the concept that dead_end is a uniform cost search by adding a "heuristic" (think a-star search) when building blocks. At a high level, a heuristic is a quality measure that can also be incomplete. For a great explanation, see https://www.redblobgames.com/pathfinding/a-star/introduction.html. What heuristic can we add? We know that if the code has an unbalanced pair of special characters, it cannot be valid. For example, this code `{ hello: "world"` is unbalanced. It is missing a `}`. It contains a syntax error due to this imbalance. In the dead_end code, we can count known special pairs using `Ripper.lex` output (which we already have). Here are the currently tracked pairs: - `{}` - `()` - `[]` - keyword/end This information can be used as a heuristic. Code with unbalanced pairs always contains a syntax error, but balanced code may have a different (harder to detect) syntax error. The code's balance hints at how we should expand an existing code block. For example, the code `{ hello: "world"` must expand towards the right/down to be valid. This code would be known as "left-leaning" as it is heavier on the left side. Rather than searching in an arbitrary direction, the heuristic determines the most sensible direction to expand. Previously, we expanded blocks by scanning up and down, counting keyword/end pairs (outside of the block), and looking at indentation. This process was adequate but required that we take small steps and produce small blocks. It also has no concept of if the code it holds is syntactically valid or not until a full Ripper parse is performed. That combination means it can produce invalid code blocks (which can be desirable when hunting for invalid code). But when we want blocks to be valid, we can make more efficient moves. ## Implementation: LexDiffPair helper class I introduced a new class, `LexDiffPair`, to accommodate a heuristic expansion. A `LexDiffPair` can be in one of several states: - Balanced (:equal) - Leaning left (:left) - Leaning right (:right) - Leaning both (:both) > An example of code line leaning both ways would be `) do |x|`. Here the `)` is leaning right (needs to expand up) while the `do` is leaning left (needs to expand down). Each code line generates its own `LexDiffPair`. Internally the information is stored as a diff of a count of each of the above pairs. A positive number indicates left-leaning, a negative number indicates right-leaning, a zero is balanced. This format allows for fast concatenation and comparison. ## BalanceHeuristicExpand When a block is passed to `BalanceHeuristicExpand`, a `LexPairDiff` is created from all of its lines. The code's balance is then used to determine which direction to expand. If the code is leaning left or right, the main goal is to get it back into balance. Continue to expand it until balance is achieved. If the code is already balanced, try to expand it while maintaining balance. For example, add already balanced code and any above/below pair of lines that cancel out. Finally, intentionally knock it off-balance by expanding up and then recurse the process (to ideally re-balance it and then exit). If the code is leaning both ways, try to grab any extra "free" (already balanced) lines. Then expand up and down since it must expand both ways to become valid. As this process is repeated, it should eventually find a match in one direction and then be leaning left or right, which will expand faster on the next iteration (after it is put back on the frontier and popped again later). Example: An invalid code block might come in like this: ``` start_line: keyword.location.start_line, start_char: keyword.location.end_char + 1, end_line: last_node.location.end_line, end_char: last_node.location.end_char ``` And it could expand it to: ``` if exceptions || variable RescueEx.new( exceptions: exceptions, variable: variable, location: Location.new( start_line: keyword.location.start_line, start_char: keyword.location.end_char + 1, end_line: last_node.location.end_line, end_char: last_node.location.end_char ) ) end ``` > Note it would expand it quite a bit more, but I'm abbreviating here. Here's the complete process across several expansions https://gist.github.com/schneems/e62216b610a36a81a98d9d8146b0611a ## When to use heuristic expand After experimentation, the best place to use this new expansion technique was when an existing invalid block comes off of the frontier. This algorithm tends to correct such blocks and balance them out. When a block is "corrected" in this way, it reduces the overall number of times the document must be passed to Ripper (extremely slow). Also, since larger blocks reduce the overall iteration, we try to expand several times (while preserving balance) and take the largest valid expansion. We use the original indentation-based expansion for code blocks that are already valid. The downside of using a heuristic that preserves balance is that ultimately we want the algorithm to generate invalid blocks. The original expansion can produce invalid expansions, which is useful. There is a separate process for adding unvisited lines to the frontier (rather than expanding existing blocks). Unvisited lines are not a good candidate for heuristic expansion as it works a little too well. If we only add "unbalanced" code in an added block, we lose some of the context we desire (see comments for more details in `parse_blocks_from_indent_line`). ## Concerns One concern with this implementation is that calling the heuristic expansion three times was the only way to produce valid results. I'm not sure why. It might be an undiscovered property of the system, or perhaps all of my code examples to date are somehow biased in a specific way. The way to get more information is to put it out into the world and get feedback. Another concern is that there are now two expansion classes. Or three if you count `parse_blocks_from_indent_line`. It's not always clear why you would choose one over another except that "it provides the best outcome". It might be possible to simplify some of this logic or remove or combine competing expansion methods. Hopefully, patterns will emerge pointing to opportunities to guide that usage.
1119247 to
a21f401
Compare
Collaborator
Author
|
This more or less works due to accident. I don't feel comfortable merging it. However this work did serve as useful experience for another perf optimization that I think can be merged over at branch https://github.com/zombocom/dead_end/tree/schneems/block-node-attempt-2 (which I will extract and clean up in another branch) |
schneems
added a commit
that referenced
this pull request
Feb 7, 2022
I used this as a starting point for work on building a tree as it has the LeftRightPairDiff internals I wanted. Basically everything below this is true in isolation but false in its entirety. I ended up not using the BalanceHeuristicExpand as a concept #129. TLDR; I've been trying to optimize a "worst-case" perf scenario (9,000 line file) to get under my 1-second threshold. When I started, it was at ~2 seconds. After this last optimization, it's well under the threshold! ``` Before [main]: 1.22 seconds After [this commit]: 0.67386 seconds ``` > Cumulatively, this is 2.6x faster than 3.1.1 and 47x faster than 1.2.0. ## Profiling before/after Expansion before (called 2,291 times, 42.04% of overall time) Parse before (called 3,315 for 31.16% of overall time)   Expansion after (called 654 times, 29.42% of overall time) Parse after (called 1,507 times for 29.35% of overall time)  > Note that parsing happens within the expansion method call, so while it seems "cheaper" to parse than expand based on this profiling data, it's the parsing that is making expansion slow. ## Goal Make the algorithm faster, so it doesn't timeout even when given a file with 9,000 lines. ## Strategy (background) There are two general strategies for improving dead_end perf: - Reduce/remove calls to Ripper (syntax error detection), as it is slow. For example, not calling Ripper if all code lines have been previously "hidden" (indicating valid code). - Improve computational complexity for non-ripper operations. For example, using a priority heap over sorting an array We call Ripper for 2 cases. Both for individual code blocks to see if it contains a syntax error. We also call Ripper later on the whole document to see if that particular syntax error was making the document unparsable. Ripper is slower to parse more lines, so we only re-parse the entire document when we detect a new invalid block as a prior optimization. If we can build "better" valid blocks, then we call Ripper fewer times on the overall document. If we can build larger blocks, we reduce the overall number of iterations for the algorithm. This property reduces Ripper calls and speeds up general "computational complexity" as there are N fewer operations to perform overall. ## Approach This approach builds on the concept that dead_end is a uniform cost search by adding a "heuristic" (think a-star search) when building blocks. At a high level, a heuristic is a quality measure that can also be incomplete. For a great explanation, see https://www.redblobgames.com/pathfinding/a-star/introduction.html. What heuristic can we add? We know that if the code has an unbalanced pair of special characters, it cannot be valid. For example, this code `{ hello: "world"` is unbalanced. It is missing a `}`. It contains a syntax error due to this imbalance. In the dead_end code, we can count known special pairs using `Ripper.lex` output (which we already have). Here are the currently tracked pairs: - `{}` - `()` - `[]` - keyword/end This information can be used as a heuristic. Code with unbalanced pairs always contains a syntax error, but balanced code may have a different (harder to detect) syntax error. The code's balance hints at how we should expand an existing code block. For example, the code `{ hello: "world"` must expand towards the right/down to be valid. This code would be known as "left-leaning" as it is heavier on the left side. Rather than searching in an arbitrary direction, the heuristic determines the most sensible direction to expand. Previously, we expanded blocks by scanning up and down, counting keyword/end pairs (outside of the block), and looking at indentation. This process was adequate but required that we take small steps and produce small blocks. It also has no concept of if the code it holds is syntactically valid or not until a full Ripper parse is performed. That combination means it can produce invalid code blocks (which can be desirable when hunting for invalid code). But when we want blocks to be valid, we can make more efficient moves. ## Implementation: LexDiffPair helper class I introduced a new class, `LexDiffPair`, to accommodate a heuristic expansion. A `LexDiffPair` can be in one of several states: - Balanced (:equal) - Leaning left (:left) - Leaning right (:right) - Leaning both (:both) > An example of code line leaning both ways would be `) do |x|`. Here the `)` is leaning right (needs to expand up) while the `do` is leaning left (needs to expand down). Each code line generates its own `LexDiffPair`. Internally the information is stored as a diff of a count of each of the above pairs. A positive number indicates left-leaning, a negative number indicates right-leaning, a zero is balanced. This format allows for fast concatenation and comparison. ## BalanceHeuristicExpand When a block is passed to `BalanceHeuristicExpand`, a `LexPairDiff` is created from all of its lines. The code's balance is then used to determine which direction to expand. If the code is leaning left or right, the main goal is to get it back into balance. Continue to expand it until balance is achieved. If the code is already balanced, try to expand it while maintaining balance. For example, add already balanced code and any above/below pair of lines that cancel out. Finally, intentionally knock it off-balance by expanding up and then recurse the process (to ideally re-balance it and then exit). If the code is leaning both ways, try to grab any extra "free" (already balanced) lines. Then expand up and down since it must expand both ways to become valid. As this process is repeated, it should eventually find a match in one direction and then be leaning left or right, which will expand faster on the next iteration (after it is put back on the frontier and popped again later). Example: An invalid code block might come in like this: ``` start_line: keyword.location.start_line, start_char: keyword.location.end_char + 1, end_line: last_node.location.end_line, end_char: last_node.location.end_char ``` And it could expand it to: ``` if exceptions || variable RescueEx.new( exceptions: exceptions, variable: variable, location: Location.new( start_line: keyword.location.start_line, start_char: keyword.location.end_char + 1, end_line: last_node.location.end_line, end_char: last_node.location.end_char ) ) end ``` > Note it would expand it quite a bit more, but I'm abbreviating here. Here's the complete process across several expansions https://gist.github.com/schneems/e62216b610a36a81a98d9d8146b0611a ## When to use heuristic expand After experimentation, the best place to use this new expansion technique was when an existing invalid block comes off of the frontier. This algorithm tends to correct such blocks and balance them out. When a block is "corrected" in this way, it reduces the overall number of times the document must be passed to Ripper (extremely slow). Also, since larger blocks reduce the overall iteration, we try to expand several times (while preserving balance) and take the largest valid expansion. We use the original indentation-based expansion for code blocks that are already valid. The downside of using a heuristic that preserves balance is that ultimately we want the algorithm to generate invalid blocks. The original expansion can produce invalid expansions, which is useful. There is a separate process for adding unvisited lines to the frontier (rather than expanding existing blocks). Unvisited lines are not a good candidate for heuristic expansion as it works a little too well. If we only add "unbalanced" code in an added block, we lose some of the context we desire (see comments for more details in `parse_blocks_from_indent_line`). ## Concerns One concern with this implementation is that calling the heuristic expansion three times was the only way to produce valid results. I'm not sure why. It might be an undiscovered property of the system, or perhaps all of my code examples to date are somehow biased in a specific way. The way to get more information is to put it out into the world and get feedback. Another concern is that there are now two expansion classes. Or three if you count `parse_blocks_from_indent_line`. It's not always clear why you would choose one over another except that "it provides the best outcome". It might be possible to simplify some of this logic or remove or combine competing expansion methods. Hopefully, patterns will emerge pointing to opportunities to guide that usage.
schneems
added a commit
that referenced
this pull request
Feb 7, 2022
I wanted to start with this test case as it has a property that causes it to fail even when most/all other test cases pass. It's the lone holdover that resisted my initial attempts to add a "heuristic expansion" in #129.
schneems
added a commit
that referenced
this pull request
Jun 4, 2022
I used this as a starting point for work on building a tree as it has the LeftRightPairDiff internals I wanted. Basically everything below this is true in isolation but false in its entirety. I ended up not using the BalanceHeuristicExpand as a concept #129. TLDR; I've been trying to optimize a "worst-case" perf scenario (9,000 line file) to get under my 1-second threshold. When I started, it was at ~2 seconds. After this last optimization, it's well under the threshold! ``` Before [main]: 1.22 seconds After [this commit]: 0.67386 seconds ``` > Cumulatively, this is 2.6x faster than 3.1.1 and 47x faster than 1.2.0. ## Profiling before/after Expansion before (called 2,291 times, 42.04% of overall time) Parse before (called 3,315 for 31.16% of overall time)   Expansion after (called 654 times, 29.42% of overall time) Parse after (called 1,507 times for 29.35% of overall time)  > Note that parsing happens within the expansion method call, so while it seems "cheaper" to parse than expand based on this profiling data, it's the parsing that is making expansion slow. ## Goal Make the algorithm faster, so it doesn't timeout even when given a file with 9,000 lines. ## Strategy (background) There are two general strategies for improving dead_end perf: - Reduce/remove calls to Ripper (syntax error detection), as it is slow. For example, not calling Ripper if all code lines have been previously "hidden" (indicating valid code). - Improve computational complexity for non-ripper operations. For example, using a priority heap over sorting an array We call Ripper for 2 cases. Both for individual code blocks to see if it contains a syntax error. We also call Ripper later on the whole document to see if that particular syntax error was making the document unparsable. Ripper is slower to parse more lines, so we only re-parse the entire document when we detect a new invalid block as a prior optimization. If we can build "better" valid blocks, then we call Ripper fewer times on the overall document. If we can build larger blocks, we reduce the overall number of iterations for the algorithm. This property reduces Ripper calls and speeds up general "computational complexity" as there are N fewer operations to perform overall. ## Approach This approach builds on the concept that dead_end is a uniform cost search by adding a "heuristic" (think a-star search) when building blocks. At a high level, a heuristic is a quality measure that can also be incomplete. For a great explanation, see https://www.redblobgames.com/pathfinding/a-star/introduction.html. What heuristic can we add? We know that if the code has an unbalanced pair of special characters, it cannot be valid. For example, this code `{ hello: "world"` is unbalanced. It is missing a `}`. It contains a syntax error due to this imbalance. In the dead_end code, we can count known special pairs using `Ripper.lex` output (which we already have). Here are the currently tracked pairs: - `{}` - `()` - `[]` - keyword/end This information can be used as a heuristic. Code with unbalanced pairs always contains a syntax error, but balanced code may have a different (harder to detect) syntax error. The code's balance hints at how we should expand an existing code block. For example, the code `{ hello: "world"` must expand towards the right/down to be valid. This code would be known as "left-leaning" as it is heavier on the left side. Rather than searching in an arbitrary direction, the heuristic determines the most sensible direction to expand. Previously, we expanded blocks by scanning up and down, counting keyword/end pairs (outside of the block), and looking at indentation. This process was adequate but required that we take small steps and produce small blocks. It also has no concept of if the code it holds is syntactically valid or not until a full Ripper parse is performed. That combination means it can produce invalid code blocks (which can be desirable when hunting for invalid code). But when we want blocks to be valid, we can make more efficient moves. ## Implementation: LexDiffPair helper class I introduced a new class, `LexDiffPair`, to accommodate a heuristic expansion. A `LexDiffPair` can be in one of several states: - Balanced (:equal) - Leaning left (:left) - Leaning right (:right) - Leaning both (:both) > An example of code line leaning both ways would be `) do |x|`. Here the `)` is leaning right (needs to expand up) while the `do` is leaning left (needs to expand down). Each code line generates its own `LexDiffPair`. Internally the information is stored as a diff of a count of each of the above pairs. A positive number indicates left-leaning, a negative number indicates right-leaning, a zero is balanced. This format allows for fast concatenation and comparison. ## BalanceHeuristicExpand When a block is passed to `BalanceHeuristicExpand`, a `LexPairDiff` is created from all of its lines. The code's balance is then used to determine which direction to expand. If the code is leaning left or right, the main goal is to get it back into balance. Continue to expand it until balance is achieved. If the code is already balanced, try to expand it while maintaining balance. For example, add already balanced code and any above/below pair of lines that cancel out. Finally, intentionally knock it off-balance by expanding up and then recurse the process (to ideally re-balance it and then exit). If the code is leaning both ways, try to grab any extra "free" (already balanced) lines. Then expand up and down since it must expand both ways to become valid. As this process is repeated, it should eventually find a match in one direction and then be leaning left or right, which will expand faster on the next iteration (after it is put back on the frontier and popped again later). Example: An invalid code block might come in like this: ``` start_line: keyword.location.start_line, start_char: keyword.location.end_char + 1, end_line: last_node.location.end_line, end_char: last_node.location.end_char ``` And it could expand it to: ``` if exceptions || variable RescueEx.new( exceptions: exceptions, variable: variable, location: Location.new( start_line: keyword.location.start_line, start_char: keyword.location.end_char + 1, end_line: last_node.location.end_line, end_char: last_node.location.end_char ) ) end ``` > Note it would expand it quite a bit more, but I'm abbreviating here. Here's the complete process across several expansions https://gist.github.com/schneems/e62216b610a36a81a98d9d8146b0611a ## When to use heuristic expand After experimentation, the best place to use this new expansion technique was when an existing invalid block comes off of the frontier. This algorithm tends to correct such blocks and balance them out. When a block is "corrected" in this way, it reduces the overall number of times the document must be passed to Ripper (extremely slow). Also, since larger blocks reduce the overall iteration, we try to expand several times (while preserving balance) and take the largest valid expansion. We use the original indentation-based expansion for code blocks that are already valid. The downside of using a heuristic that preserves balance is that ultimately we want the algorithm to generate invalid blocks. The original expansion can produce invalid expansions, which is useful. There is a separate process for adding unvisited lines to the frontier (rather than expanding existing blocks). Unvisited lines are not a good candidate for heuristic expansion as it works a little too well. If we only add "unbalanced" code in an added block, we lose some of the context we desire (see comments for more details in `parse_blocks_from_indent_line`). ## Concerns One concern with this implementation is that calling the heuristic expansion three times was the only way to produce valid results. I'm not sure why. It might be an undiscovered property of the system, or perhaps all of my code examples to date are somehow biased in a specific way. The way to get more information is to put it out into the world and get feedback. Another concern is that there are now two expansion classes. Or three if you count `parse_blocks_from_indent_line`. It's not always clear why you would choose one over another except that "it provides the best outcome". It might be possible to simplify some of this logic or remove or combine competing expansion methods. Hopefully, patterns will emerge pointing to opportunities to guide that usage.
schneems
added a commit
that referenced
this pull request
Jun 4, 2022
I wanted to start with this test case as it has a property that causes it to fail even when most/all other test cases pass. It's the lone holdover that resisted my initial attempts to add a "heuristic expansion" in #129.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TLDR; I've been trying to optimize a "worst-case" perf scenario (9,000 line file) to get under my 1-second threshold. When I started, it was at ~2 seconds. After this last optimization, it's well under the threshold!
Profiling before/after
Expansion before (called 2,291 times, 42.04% of overall time)


Parse before (called 3,315 for 31.16% of overall time)
Expansion after (called 654 times, 29.42% of overall time)

Parse after (called 1,507 times for 29.35% of overall time)
Goal
Make the algorithm faster, so it doesn't timeout even when given a file with 9,000 lines.
Strategy (background)
There are two general strategies for improving dead_end perf:
We call Ripper for 2 cases. Both for individual code blocks to see if it contains a syntax error. We also call Ripper later on the whole document to see if that particular syntax error was making the document unparsable. Ripper is slower to parse more lines, so we only re-parse the entire document when we detect a new invalid block as a prior optimization.
If we can build "better" valid blocks, then we call Ripper fewer times on the overall document. If we can build larger blocks, we reduce the overall number of iterations for the algorithm. This property reduces Ripper calls and speeds up general "computational complexity" as there are N fewer operations to perform overall.
Heuristic Strategy
This PR builds on the concept that dead_end is a uniform cost search by adding a "heuristic" (think a-star search) when building blocks. At a high level, a heuristic is a quality measure that can also be incomplete. For a great explanation, see https://www.redblobgames.com/pathfinding/a-star/introduction.html.
What heuristic can we add? We know that if the code has an unbalanced pair of special characters, it cannot be valid. For example, this code
{ hello: "world"is unbalanced. It is missing a}. It contains a syntax error due to this imbalance.In the dead_end code, we can count known special pairs using
Ripper.lexoutput (which we already have). Here are the currently tracked pairs:{}()[]This information can be used as a heuristic. Code with unbalanced pairs always contains a syntax error, but balanced code may have a different (harder to detect) syntax error. The code's balance hints at how we should expand an existing code block.
For example, the code
{ hello: "world"must expand towards the right/down to be valid. This code would be known as "left-leaning" as it is heavier on the left side.Rather than searching in an arbitrary direction, the heuristic determines the most sensible direction to expand.
Previously, we expanded blocks by scanning up and down, counting keyword/end pairs (outside of the block), and looking at indentation. This process was adequate but required that we take small steps and produce small blocks. It also has no concept of if the code it holds is syntactically valid or not until a full Ripper parse is performed. That combination means it can produce invalid code blocks (which can be desirable when hunting for invalid code). But when we want blocks to be valid, we can make more efficient moves.
Implementation: LexDiffPair helper class
I introduced a new class,
LexDiffPair, to accommodate a heuristic expansion. It tracks pairs of characters that we care about{},(),[], and keyword/end. The heuristic algorithm uses this data to inform block generation. Since the pair data is expected (and later benchmarked to confirm) to be a hotspot, it's implementation needs to be optimized. This section describes the implementation of theLexDiffPairclass.A
LexDiffPaircan be in one of several states:Each code line generates its own
LexDiffPair.Internally the information is stored as a diff of a count of each of the above pairs. A positive number indicates left-leaning, a negative number indicates right-leaning, a zero is balanced. This format allows for fast concatenation and comparison. It allows us to encode a lot of information in a comparatively small space.
The data was originally stored in a hash, then as an array. Ultimately using instance variables in a class was found to be the fastest option.
BalanceHeuristicExpand
When a block is passed to
BalanceHeuristicExpand, aLexPairDiffis created from all of its lines. The code's balance is then used to determine which direction to expand.Heuristic states
The main algorithm functions as a state machine based on which way the given code block is "leaning".
left or right: If the code is leaning left or right, the main goal is to get it back into balance. Accumulate lines in the correct direction until the overall block changes its leaning.
equal: If the code is already balanced, try to expand it while maintaining balance. For example, add already balanced code and any above/below pair of lines that cancel out. Finally, intentionally knock it off-balance by expanding up and then recurse the process (to ideally re-balance it and then exit).
both: If the code is leaning both ways, try to grab any extra "free" (already balanced) lines. Then expand up and down since it must expand both ways to become valid. As this process is repeated, it should eventually find a match in one direction and then be leaning left or right, which will expand faster on the next iteration (after it is put back on the frontier and popped again later).
Example:
An invalid code block might come in like this:
And it could expand it to:
When to use heuristic expand
After the expansion algorithm was performing as desired it still needs to be used in the overall search algorithm.
After experimentation, the best place to use this new expansion technique was when an existing invalid block comes off of the frontier. The heuristic algorithm tends to correct such blocks and balance them out. When a block is "corrected" in this way, it reduces the overall number of times the document must be passed to Ripper (extremely slow). Also, since larger blocks reduce the overall iteration, we try to expand several times (while preserving balance) and take the largest valid expansion.
We use the original indentation-based expansion for code blocks that are already valid. The downside of using a heuristic that preserves balance is that ultimately we want the algorithm to generate invalid blocks. The original expansion can produce invalid expansions, which is useful.
There is a separate process for adding unvisited lines to the frontier (rather than expanding existing blocks). Unvisited lines are not a good candidate for heuristic expansion as it works a little too well. If we only add "unbalanced" code in an added block, we lose some of the context we desire (see comments for more details in
parse_blocks_from_indent_line).Concerns
One concern with this implementation is that calling the heuristic expansion three times was the only way to produce valid results. I'm not sure why. It might be an undiscovered property of the system, or perhaps all of my code examples to date are somehow biased in a specific way. The way to get more information is to put it out into the world and get feedback.
Another concern is that there are now two expansion classes. Or three if you count
parse_blocks_from_indent_line. It's not always clear why you would choose one over another except that "it provides the best outcome".It might be possible to simplify some of this logic or remove or combine competing expansion methods. Hopefully, patterns will emerge pointing to opportunities to guide that usage.