[Optimization] Advance parser concurrently with model forward pass #1065

hudson-ai · 2024-10-28T16:25:12Z

Refactors Engine.__call__ and TokenParser._parse coroutine/generator to yield a future wrapping LLInterpreter.mid_process (running in a ThreadPoolExecutor) such that mid_process can run concurrently with the forward pass. This function comes directly from the rust extension, so it releases the GIL and threading should be sufficient to ensure true concurrency.

@lochuynh1412 I stole a bit of your code from the interactivity overhaul in order to simplify Engine.__call__. I'm definitely introducing a merge conflict for that PR -- happy to help resolve it :)

Co-authored-by: Loc Huynh <lohuynh@microsoft.com>

…ration but the final one

…ture

hudson-ai · 2024-10-29T00:21:34Z

@Harsha-Nori ran some quick benchmarks to make sure the parallelization was actually working. I ran a JSON generation task (RPG characters) with two different models. Statistics are reported in milliseconds per token.

Meta-Llama-3.1-8B-Instruct-Q8_0

266 tokens (ran at zero temperature, so this was consistent across runs)

branch	count	mean	std	min	25%	50%	75%	max
main	100	16.4348	0.0302434	16.391	16.4157	16.4261	16.4468	16.5256
parallel_parser	100	15.1874	0.0314103	15.1202	15.1659	15.1843	15.2038	15.2796

95% CI for difference in means: (1.239, 1.256)
So there is about a 1.25ms decrease in the time to generate each token. Sounds marginal, but in this case that's about an 8% difference.

Phi-3-mini-4k-instruct-q4

103 tokens (ran at zero temperature, so this was consistent across runs)

branch	count	mean	std	min	25%	50%	75%	max
main	100	5.01469	0.276793	4.91669	4.94167	4.95379	4.97787	6.56915
parallel_parser	100	4.95095	0.271921	4.86816	4.88707	4.89749	4.90545	6.42321

95% CI for difference in means: (-0.010, 0.141)
There's little to no speedup in this case, maybe due to the low number of total tokens generated or just due to the much quicker forward pass time overall.

Conclusions

We see some slight speedups, and any threading overhead seems pretty negligible. I feel comfortable pushing this PR forward :)

hudson-ai · 2024-10-29T00:44:34Z

Tests currently failing because a consequence of this PR is that we do an additional forward pass at the end of generation because we do get_logits before we've heard back from the parser (which is what will tell us that we're "done"). @mmoskal would it be possible to know whether we're done in the last "post_process" step, or is this a hard limitation?

mmoskal · 2024-10-30T23:38:42Z

@hudson-ai let me know if this helps: guidance-ai/llguidance@e9cfc18

hudson-ai · 2024-10-31T19:22:52Z

@mmoskal thanks a bunch!

We can now often (not always, mind you) prevent an unnecessary forward pass if the next call to the parser is going to give us the stop signal. The smoke tests that previously failed happen to be such cases, so they will now pass. Just note that changing the grammars inside of them may cause this to change.

codecov-commenter · 2024-10-31T19:32:57Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 88.63636% with 10 lines in your changes missing coverage. Please review.

Project coverage is 65.36%. Comparing base (0ace873) to head (42a5e5e).

Files with missing lines	Patch %	Lines
guidance/_parser.py	90.38%	5 Missing ⚠️
guidance/models/_model.py	85.29%	5 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1065      +/-   ##
==========================================
- Coverage   66.48%   65.36%   -1.12%     
==========================================
  Files          65       65              
  Lines        5102     5140      +38     
==========================================
- Hits         3392     3360      -32     
- Misses       1710     1780      +70

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lochuynh1412 · 2024-11-04T21:07:53Z

guidance/_parser.py

+            # Upstairs should have already waited on this future
+            mask, _ = mid_process_future.result()
+
+            if mask is None:
                if token is not None:
                    raise TokenParserException(f"Expected None, got token {token}")


if mask is None, isn't it in accepting mode? any tokens should be accepted?

The mask should never be none unless the parser is actually done (i.e. we should not be accepting ANY tokens, as the loop should be stopping). This condition should be equivalent to ll_response.stop if we were to parse the string in the second slot of the future above

Note the .cleanup code, which is currently responsible for sending the final None token to get the generator loop to break. Let me know if you have any better ideas on how to structure it!

lochuynh1412 · 2024-11-04T22:15:00Z

guidance/_parser.py

-                gen_data = None
-                token = yield (gen_data, response)
+            # Upstairs should have already waited on this future
+            mask, _ = mid_process_future.result()


Why don't we get LLInterpreterResponse.model_validate_json(ll_response_str) here instead of just the string?
Because both parser and engine check ll_response.stop to break, why we need cleanup function?

Why don't we get LLInterpreterResponse.model_validate_json(ll_response_str) here instead of just the string?

We could add the pydantic validation into the thread to make sure the object-version of the string is always returned from the future -- I suppose I was feeling cautious that adding some CPU work into the thread might block the forward pass and slow things down because ya know, GIL. But I think this was unfounded since we have to wait on that work regardless. I can try this and see if it affects timings at all.

Because both parser and engine check ll_response.stop to break, why we need cleanup function?

This is definitely a bit annoying... The parser loop isn't running while the "upstairs" caller is running -- i.e. it can't even check the value of ll_response.stop until the caller sends it a final None. Technically, we can just abandon the generator before it terminates, but that puts us in the confusing situation where not all the code in the parser actually runs. The cleanup exists out of an abundance of caution, just making sure that the parser generator finishes, doing any final checks/validation on state as we do so. Happy to jump on a call to discuss.

Thank you for looking this over!

We could add the pydantic validation into the thread to make sure the object-version of the string is always returned from the future -- I suppose I was feeling cautious that adding some CPU work into the thread might block the forward pass and slow things down because ya know, GIL. But I think this was unfounded since we have to wait on that work regardless. I can try this and see if it affects timings at all.

Yeah, we can try. guess it'll make the code a bit cleaner. I don't think it will consume a lot of cpu cycles to validate the JSON string.

It's really just on the order of 1-3 hundredths of a millisecond to parse the string. Should be fine to throw it in the thread (again, the engine call would have had to do it sequentially anyway, so this may very well cost literally nothing and give us some cleaner code)

… simplify interface

hudson-ai and others added 12 commits October 25, 2024 12:09

Simplify Engine.__call__ to remove second call to get_next_token

820ed87

Co-authored-by: Loc Huynh <lohuynh@microsoft.com>

simplify parser loop since we know mask will be non-none on every ite…

9f434ad

…ration but the final one

simplify model.__call__ loop again

e5ee621

prototype concurrent parser

1b49dd3

generator cleanup

f9d38fd

move Mock temperature hook from get_next_token to sample_with_tempera…

facafb7

…ture

wrong assert

28053c1

silence cleanup exceptions in garbage collection

baa1d6b

Simplify ByteParser

b1545ed

Allow non-concurrent path with get_next_token

d97d72a

wrong assert

f59f1fb

Merge branch 'main' into parallel_parser

8aa882c

hudson-ai changed the title ~~[Optimization]~~ [Optimization] Advance parser concurrently with model forward pass Oct 28, 2024

hudson-ai requested review from Harsha-Nori and paulbkoch October 28, 2024 16:31

test associativity on get_logits rather than get_next_token

f8779e7

fix associativity test to get the args of the FIRST call

fe33742

hudson-ai added 4 commits October 31, 2024 10:53

use has_pending_stop to prevent unnecessary forward pass

e0d8e69

comment

1b2c5e2

prevent parser cleanup from raising exceptions at system exit

e718137

bump llg

27b3d19

lochuynh1412 reviewed Nov 4, 2024

View reviewed changes

hudson-ai added 2 commits November 5, 2024 09:50

move LLInterpreterResponse validation into thread with mid_process to…

a22a40c

… simplify interface

add some comments

93cff1f

hudson-ai added 2 commits November 5, 2024 10:47

fix exception

90e5b6e

Merge branch 'main' into parallel_parser

42a5e5e

Harsha-Nori merged commit b6bcee7 into guidance-ai:main Nov 10, 2024
98 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization] Advance parser concurrently with model forward pass #1065

[Optimization] Advance parser concurrently with model forward pass #1065

hudson-ai commented Oct 28, 2024 •

edited

Loading

hudson-ai commented Oct 29, 2024

hudson-ai commented Oct 29, 2024

mmoskal commented Oct 30, 2024

hudson-ai commented Oct 31, 2024

codecov-commenter commented Oct 31, 2024 •

edited

Loading

lochuynh1412 Nov 4, 2024

hudson-ai Nov 4, 2024 •

edited

Loading

hudson-ai Nov 4, 2024

lochuynh1412 Nov 4, 2024

hudson-ai Nov 4, 2024 •

edited

Loading

lochuynh1412 Nov 4, 2024

hudson-ai Nov 5, 2024

[Optimization] Advance parser concurrently with model forward pass #1065

[Optimization] Advance parser concurrently with model forward pass #1065

Conversation

hudson-ai commented Oct 28, 2024 • edited Loading

hudson-ai commented Oct 29, 2024

Meta-Llama-3.1-8B-Instruct-Q8_0

Phi-3-mini-4k-instruct-q4

Conclusions

hudson-ai commented Oct 29, 2024

mmoskal commented Oct 30, 2024

hudson-ai commented Oct 31, 2024

codecov-commenter commented Oct 31, 2024 • edited Loading

Codecov Report

lochuynh1412 Nov 4, 2024

Choose a reason for hiding this comment

hudson-ai Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

hudson-ai Nov 4, 2024

Choose a reason for hiding this comment

lochuynh1412 Nov 4, 2024

Choose a reason for hiding this comment

hudson-ai Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

lochuynh1412 Nov 4, 2024

Choose a reason for hiding this comment

hudson-ai Nov 5, 2024

Choose a reason for hiding this comment

hudson-ai commented Oct 28, 2024 •

edited

Loading

codecov-commenter commented Oct 31, 2024 •

edited

Loading

hudson-ai Nov 4, 2024 •

edited

Loading

hudson-ai Nov 4, 2024 •

edited

Loading