Skip to content

Conversation

@fabianlim
Copy link
Contributor

@fabianlim fabianlim commented Jul 22, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

The test_mamba_ssm_ssd introduced in #10909 is currently problematic as it does not assert failures. Therefore the tests were not properly checked and were actually failing.

  • to make the tests pass, the thresholds are adjusted in this PR, tested on H100.
  • the threshold for test_mamba_chunk_scan_cont_batch is adjusted now to quite high, but the plan is to revisit this later

cc: @tdoublep @tlrmchlsmth @cyang49

Test Plan

pytest tests/kernels/mamba/test_mamba_ssm_ssd.py 

Test Result

(Optional) Documentation Update

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an issue where tests were not asserting failures by replacing torch.allclose with torch.testing.assert_close. My main feedback is on the error tolerances, which have been significantly increased to make the tests pass. In one case, the tolerance is so high (50%) that it may render the test less effective at catching meaningful regressions. I've added comments suggesting ways to handle this, such as using conditional tolerances for different data types and adding TODOs to ensure they are tracked and addressed in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tolerances in this test have been increased significantly. While this might be necessary to make the test pass with float16 on newer hardware, it's a very loose check, especially for float32. A 5% deviation can mask meaningful numerical issues. Could you please check if float32 can pass with a tighter tolerance? If so, it would be best to use different tolerances based on the itype.

Suggested change
torch.testing.assert_close(Y[:, -1], Y_min[:, -1], atol=5e-2, rtol=5e-2)
if itype == torch.float16:
y_atol, y_rtol = 5e-2, 5e-2
state_atol, state_rtol = 1e-2, 5e-2
else:
# Use original, tighter tolerances for float32
y_atol, y_rtol = 1e-3, 1e-3
state_atol, state_rtol = 1e-3, 1e-3
torch.testing.assert_close(Y[:, -1], Y_min[:, -1], atol=y_atol, rtol=y_rtol)
torch.testing.assert_close(final_state[:, -1],
final_state_min[:, -1].to(torch.float32),
atol=state_atol,
rtol=state_rtol)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in #21379

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The new tolerance values atol=5e-1 and rtol=5e-1 are extremely high, allowing for a 50% relative difference, which may mask significant numerical discrepancies or bugs in the kernel implementation. Add a TODO comment in the code explaining why the tolerance is so high and that it needs to be investigated and tightened.

// TODO: Investigate and tighten tolerance. Current values are masking potential numerical issues.
torch.testing.assert_close(Y_eg, Y_min_eg, atol=5e-1, rtol=5e-1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in #21379

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
@fabianlim fabianlim force-pushed the mamba-ssm-fix-tests branch from b41cacb to 2360bd1 Compare July 22, 2025 23:59
@DarkLight1337
Copy link
Member

cc @tdoublep

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the high tolerances need to be investigated as a follow-up, but having asserts with a high tolerance is better than having no asserts at all.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 23, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) July 23, 2025 07:52
@vllm-bot vllm-bot merged commit 32ec9e2 into vllm-project:main Jul 23, 2025
43 of 44 checks passed
@fabianlim fabianlim deleted the mamba-ssm-fix-tests branch July 23, 2025 12:44
zixi-qi pushed a commit to zixi-qi/vllm that referenced this pull request Jul 23, 2025
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: qizixi <qizixi@meta.com>
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: x22x22 <wadeking@qq.com>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants