test: improve pytest infrastructure and vLLM backend testing#416
test: improve pytest infrastructure and vLLM backend testing#416planetf1 merged 5 commits intogenerative-computing:mainfrom
Conversation
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
5a0cdf5 to
d25e4d7
Compare
d25e4d7 to
cd5a632
Compare
jakelorocco
left a comment
There was a problem hiding this comment.
The changes make sense to me. Can you please provide example outputs of the example and regular tests? I'd like to see what the vllm / heavy_gpu tests running in isolation looks like.
- Add cleanup_vllm_backend() to test/conftest.py - Remove duplicated cleanup code from test_vllm.py and test_vllm_tools.py - Addresses PR generative-computing#416 review feedback on code duplication
- Add cleanup_vllm_backend() to test/conftest.py to eliminate code duplication - Remove duplicated cleanup code from test_vllm.py and test_vllm_tools.py - Migrate from IBM aLoRA to PEFT 0.18.1 native aLoRA implementation - Add --device flag for explicit device control in aLoRA training - Add GPU memory check and better error handling - Restrict MPS patching to macOS only for cross-platform safety - Fix output dirname handling for relative paths - Add comprehensive aLoRA training tests Addresses PR generative-computing#416 review feedback on code duplication
- Add cleanup_vllm_backend() to test/conftest.py with comprehensive documentation - Remove duplicated cleanup code from test_vllm.py and test_vllm_tools.py - Document that cleanup is best-effort within modules - Note that cross-module GPU isolation requires process separation - Addresses PR generative-computing#416 review feedback on code duplication The shared function includes clear documentation explaining that while this cleanup helps within a module, only process exit reliably releases CUDA GPU memory. Cross-module isolation is handled by the existing pytest_collection_finish hook that runs heavy GPU tests in separate subprocesses.
- Add cleanup_vllm_backend() to test/conftest.py with comprehensive documentation - Remove duplicated cleanup code from test_vllm.py and test_vllm_tools.py - Document that cleanup is best-effort within modules - Note that cross-module GPU isolation requires process separation - Addresses PR generative-computing#416 review feedback on code duplication The shared function includes clear documentation explaining that while this cleanup helps within a module, only process exit reliably releases CUDA GPU memory. Cross-module isolation is handled by the existing pytest_collection_finish hook that runs heavy GPU tests in separate subprocesses.
0542af2 to
8a35e19
Compare
|
Pu
Here's a run. The mechanism is working reliably on a cuda environment. However the test of weather in Boston is flaky, sometimes mistral will return a response about London, or New York. This is a qualitative prompt/model issue rather than mechanism - we may need to further review/modify as we use these tests in an automated formal verification step. |
|
I noted the unreliability in responses, and in trying to address some warnings found a potential issue in how we handle tokenization. Raised issue #431 to discuss further |
|
@planetf1, in the example output above, is each collection of tests a separate run of pytest? Also, I'm looking at the output code; it looks like if the module fails, it doesn't tell us which test failed? Is it possible to do this gpu memory cleaning through a regular pytest auto-run fixture so that we keep the results in a regular pytest form? |
Yes, each collection is run separately - I was unable to get the required isolation without doing some juggling at the process level due to the way cuda allocation works (not a problem on macOS ). It would be wonderful if someone could demonstrate this is possible, and the explanation below is wrong - but I couldn't get there without splitting processes. I'll add a longer (generated) explanation below. The individual module errors are reported, but you need to look back through the output. I'll try to add a workaround for this. The CUDA Memory ProblemCUDA's Memory Model:
What Happens with Fixtures: @pytest.fixture(scope="module")
def vllm_backend():
backend = VLLMBackend(model="mistralai/Mistral-7B-Instruct-v0.3")
yield backend
# Cleanup code runs here
del backend
gc.collect()
torch.cuda.empty_cache() # This doesn't actually free memory!The Problem:
Why Process Isolation WorksWith Separate Processes: Key Difference:
Real-World ImpactWithout Process Isolation (using fixtures only): With Process Isolation: Why This Matters for vLLM/HuggingFace
Technical DetailsCUDA Memory Hierarchy:
Fixtures operate at level 3, but the problem is at level 1. SummaryWhy fixtures don't work:
Why process isolation works:
This is not a pytest limitation - it's a fundamental constraint of how CUDA manages GPU memory at the operating system level. |
|
PyTorch documentation states that even after cleanup, "the unused memory managed by the allocator More specifically for vLLM, the vLLM maintainers themselves recommend process termination as the "In general it is very difficult to clean up all resources correctly, especially when we use |
|
Added code in the cuda/large test code in |
|
Here's a log with a forced error (intentionally broke) Basically an extra summary @jakelorocco Is that a help? It only affects the cuda/high mem tests (or when a batch includes them) |
- Add pytest skip mechanism with capability detection and CLI options - Implement process isolation for GPU-intensive vLLM tests - Enhance test configuration with safe option registration - Fix vLLM structured output token limits and update documentation
- Extract duplicated cleanup code to cleanup_vllm_backend() in conftest.py - Add NCCL process group cleanup to suppress warnings - Add Mistral tokenizer mode to suppress FutureWarning - Simplify comments for conciseness
The tokenizer_mode='mistral' parameter caused a mismatch between vLLM's internal tokenizer (with Mistral mode) and the backend's separate tokenizer (without Mistral mode). This resulted in malformed tool call formatting. Reverts to default tokenizer mode to restore tool calling functionality.
Parse pytest output to show which tests failed across isolated modules. Addresses review feedback on test failure visibility.
2b9ad75 to
a4c01ed
Compare
|
|
|
Actually, it seems pytest-xdist may reuse the subprocess across tests. If it is an overkill to run all tests in subprocesses, they provide a marker like this: So we should mark vllm tests with it |
I think the current output looks clean, especially with the summary. If we can get pytest-xdist working, I'd be fine with that as well; however, I believe there were issues last time that was attempted (even when parallel processes was set to 1). |
I hope it is because the processes are reused and |
|
Agreed this commit can be merged, and later, if time allows, a PR refactoring this into |
|
I'd be good to merge, then we can iterate if we think there's a better option? |
jakelorocco
left a comment
There was a problem hiding this comment.
lgtm; future improvements can be discussed / created in new issues
66a90c7
Improve pytest infrastructure and vLLM backend testing
Type of PR
Description
Fixes #415
Enhances pytest infrastructure with capability detection and process isolation for GPU tests. Fixes process isolation to only activate on CUDA systems, preventing unnecessary overhead on macOS and systems without GPU.
Key changes:
--ignore-gpu-check,--ignore-ram-check, etc.)Testing
Tested
Note that if any large tests are run on cuda, pytest will enforce additional isolation - effectively batching up the test into groups.