Enable SDG batching with vLLM #1797

gabe-l-hart · 2024-07-18T22:19:19Z

Epic: instructlab/sdg#135
Requires: instructlab/sdg#157

When using a serving backend that supports batches of requests (i.e. vLLM), the current SDG pipeline sends the entire dataset as a single large batch of requests to the OpenAI server. This may lead to some requests waiting too long for the response, resulting in timeout errors and potentially overloading the backend server with extremely large batches.

SDG now supports launching parallel tasks to send smaller batches in parallel to the OpenAPI server. This is controlled by the batch_size argument to generate_data():

batch_size=None - use the library's built-in default batch size (currently 0, but will change to 8 in a future release)
batch_size=0 - disable batching, just use one batch
batch_size=<int> - use batches of the specified size

If the user sets the batch size to 8, the system will run a concurrent thread on each CPU on the system, each sending a batch of 8 prompts. So, on an 8 CPU system, this would result in a total of 64 requests processed simultaneously by the backend server.

This PR adds --batch-size to ilab data generate to allow overriding the default behaviour of a batch size of 8 with vLLM and zero with llama-cpp.

Supports instructlab/sdg#135

src/instructlab/data/generate.py

mergify · 2024-07-27T08:03:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. @gabe-l-hart please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

danmcp · 2024-07-27T14:06:51Z

src/instructlab/data/generate.py

+        if isinstance(backend_instance, llama_cpp_server):
+            batch_size = 0
+            logger.warning(
+                "Disabling SDG batching - unsupported with llama.cpp serving"


Should we only log this message when the user requested a particular batch-size?

Yep, good point

src/instructlab/data/generate.py

Relates to instructlab/sdg#135 Since instructlab-sdg-0.1.3, data generation in batches is supported and controlled by an parameter to the `generate_data()` function. This is not supported with llama-cpp, and so we disable it in that case. Co-authored-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Now that we require a new enough version of instructlab-sdg, we can expose this. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2024-07-27T17:40:14Z

gabe-l-hart force-pushed the SDGBatchArguments branch from 7954527 to fcb84b3
3 hours ago

@gabe-l-hart was that intentional? Sorry I didn't coordinate - just looking to get this merged quickly

Also add a troubleshooting note to reference instructlab#1892 which tracks a todo item to add some way to automatically disable in this case. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

gabe-l-hart · 2024-07-27T18:07:07Z

@markmc Sorry if that wasn't wanted! I got the mergebot conflict notification and wanted to clear the notification before diving into kid time and becoming unavailable for the rest of the weekend. Feel free to force push back if needed. (yikes, typing on a phone is hard!)

gabe-l-hart marked this pull request as draft July 18, 2024 22:19

gabe-l-hart mentioned this pull request Jul 18, 2024

LLMBlock concurrency instructlab/sdg#157

Merged

mergify bot added the ci-failure PR has at least one CI failure label Jul 18, 2024

derekhiggins reviewed Jul 19, 2024

View reviewed changes

src/instructlab/data/generate.py Outdated Show resolved Hide resolved

markmc requested changes Jul 19, 2024

View reviewed changes

src/instructlab/data/generate.py Outdated Show resolved Hide resolved

src/instructlab/data/generate.py Outdated Show resolved Hide resolved

gabe-l-hart force-pushed the SDGBatchArguments branch from e9d0d02 to bda25e9 Compare July 19, 2024 17:29

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jul 19, 2024

gabe-l-hart force-pushed the SDGBatchArguments branch from bda25e9 to f052f4b Compare July 19, 2024 17:35

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jul 19, 2024

derekhiggins mentioned this pull request Jul 26, 2024

Add data checkpointing capability instructlab/sdg#222

Merged

markmc mentioned this pull request Jul 26, 2024

[Epic] Add data checkpointing and recovery instructlab/sdg#195

Closed

4 tasks

markmc changed the title ~~SDGBatchArguments~~ Enable SDG batching with vLLM Jul 27, 2024

mergify bot added the needs-rebase This Pull Request needs to be rebased label Jul 27, 2024

This was referenced Jul 27, 2024

Make generate_data(batch_size=None) default to a batch size of 8 instructlab/sdg#224

Open

[Epic] Improve concurrency in LLMBlock instructlab/sdg#135

Closed

data generate: fix typo in --pipeline docs

451e421

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc force-pushed the SDGBatchArguments branch from f052f4b to 973a178 Compare July 27, 2024 09:02

mergify bot added dependencies Relates to dependencies ci-failure PR has at least one CI failure and removed needs-rebase This Pull Request needs to be rebased ci-failure PR has at least one CI failure labels Jul 27, 2024

markmc force-pushed the SDGBatchArguments branch from 973a178 to 8929260 Compare July 27, 2024 09:05

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jul 27, 2024

markmc force-pushed the SDGBatchArguments branch from 8929260 to 7b620b4 Compare July 27, 2024 09:07

mergify bot added CI/CD Affects CI/CD configuration and removed ci-failure PR has at least one CI failure labels Jul 27, 2024

markmc mentioned this pull request Jul 27, 2024

ilab data generate batching should be disabled automatically with a remove llama-cpp endpoint #1892

Closed

mergify bot added documentation Improvements or additions to documentation and removed ci-failure PR has at least one CI failure labels Jul 27, 2024

markmc requested review from n1hility, tiran, danmcp and leseb July 27, 2024 13:11

danmcp reviewed Jul 27, 2024

View reviewed changes

gabe-l-hart force-pushed the SDGBatchArguments branch from 7954527 to fcb84b3 Compare July 27, 2024 14:13

mergify bot added the ci-failure PR has at least one CI failure label Jul 27, 2024

danmcp reviewed Jul 27, 2024

View reviewed changes

src/instructlab/data/generate.py Outdated Show resolved Hide resolved

gabe-l-hart and others added 2 commits July 27, 2024 18:37

data generate: expose the --pipeline option

907930b

Now that we require a new enough version of instructlab-sdg, we can expose this. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc force-pushed the SDGBatchArguments branch from fcb84b3 to fafa8c0 Compare July 27, 2024 17:39

markmc requested a review from danmcp July 27, 2024 17:40

data generate: disable batching with remote llama-cpp in CI

3c03122

Also add a troubleshooting note to reference instructlab#1892 which tracks a todo item to add some way to automatically disable in this case. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc force-pushed the SDGBatchArguments branch from fafa8c0 to 3c03122 Compare July 27, 2024 17:41

mergify bot removed the ci-failure PR has at least one CI failure label Jul 27, 2024

n1hility approved these changes Jul 27, 2024

View reviewed changes

mergify bot added the one-approval PR has one approval from a maintainer label Jul 27, 2024

danmcp approved these changes Jul 27, 2024

View reviewed changes

mergify bot removed the one-approval PR has one approval from a maintainer label Jul 27, 2024

markmc approved these changes Jul 28, 2024

View reviewed changes

markmc removed request for tiran and leseb July 28, 2024 06:07

mergify bot merged commit 2a6f213 into instructlab:main Jul 28, 2024
23 checks passed

gabe-l-hart deleted the SDGBatchArguments branch July 29, 2024 14:46

ktam3 added this to the 0.18.0b3 milestone Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable SDG batching with vLLM #1797

Enable SDG batching with vLLM #1797

gabe-l-hart commented Jul 18, 2024 •

edited by markmc

Loading

mergify bot commented Jul 27, 2024

danmcp Jul 27, 2024

markmc Jul 27, 2024

markmc commented Jul 27, 2024

gabe-l-hart commented Jul 27, 2024 •

edited

Loading

Enable SDG batching with vLLM #1797

Enable SDG batching with vLLM #1797

Conversation

gabe-l-hart commented Jul 18, 2024 • edited by markmc Loading

mergify bot commented Jul 27, 2024

danmcp Jul 27, 2024

Choose a reason for hiding this comment

markmc Jul 27, 2024

Choose a reason for hiding this comment

markmc commented Jul 27, 2024

gabe-l-hart commented Jul 27, 2024 • edited Loading

gabe-l-hart commented Jul 18, 2024 •

edited by markmc

Loading

gabe-l-hart commented Jul 27, 2024 •

edited

Loading