[Model][Speculative Decoding] DeepSeek MTP spec decode #12755

luccafong · 2025-02-04T23:58:11Z

Implement DeepSeek MTP: #12181 to support DeepSeek MTP layers for next n prediction.

Online Serving
Add --num-speculative-tokens 1 for DeepSeek V3/R1:

python -m vllm.entrypoints.openai.api_server --disable-log-requests --gpu-memory-utilization 0.8  --max-model-len 65536 --max-num-seqs 128 --seed 0 --tensor-parallel-size 8 --model deepseek-ai/DeepSeek-R1 ---trust-remote-code --num-speculative-tokens 1

Offline Inference
Set num_speculative_tokens = 1

llm = LLM(
    model="deepseek-ai/DeepSeek-R1",
    tensor_parallel_size=8,
    max_model_len=8192, # If you have enough memory with your hardware, you can ignore this
    num_speculative_tokens=1, # only 1 is supported for now
   draft_tensor_parallel_size=8, # optional, by default it will be the same as tensor_parallel_size.
)

Note: This implementation validates on MTP k=1 models only.

Benchmark Results

The acceptance rate is 81% ~ 82.3% on R1 k=1.
The speedup depends on the QPS, with 1.63x speedup for QPS=1 and certain improvement with QPS<8 as shown in below table.

Results on various QPS

Draft TP=1

QPS	Baseline TPOT	k=1 TPOT	Speedup
1	55.47	33.99	1.63x
2	57.58	48.8	1.18x
4	64.29	51.02	1.26x
6	122.93	108.15	1.14x
8	120.18	119.14	1.0x

Draft TP=8

QPS	Baseline TPOT	k=1,TP=8 TPOT	Speedup
1	55.47	32.64	1.69x
2	57.58	43.6	1.32x
4	64.29	52.62	1.22x
6	122.93	129.5	< 1.0
8	120.18	139.49	< 1.0

Results on various Concurrency
Draft TP=8

MAX_CONCURRENCY	Baseline TPOT	k=1 TPOT	Speedup
1	23.13	17.24	1.34x
2	28.10	17.07	1.64x
4	27.57	21.48	1.28x
8	38.57	34.62	1.11x
16	50.24	40.89	1.22x
32	70.88	56.63	1.25x

github-actions · 2025-02-04T23:58:22Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

comaniac

Otherwise LGTM. It's pretty clean so no concerns.

comaniac · 2025-02-05T00:49:05Z

tests/spec_decode/e2e/test_mtp_correctness.py

Do you know how long does it take to run all tests in this file?

vllm/spec_decode/draft_model_runner.py

vllm/spec_decode/spec_decode_worker.py

vllm/transformers_utils/configs/__init__.py

vllm/transformers_utils/configs/deepseek_mtp.py

vllm/worker/worker.py

LiuXiaoxuanPKU · 2025-02-05T02:01:21Z

vllm/model_executor/models/deepseek_mtp.py

QQ: where did we truncate the input_ids?

for 1st stage: position 0 is masked for MTP, but it only applies to k=1, I need to change the mask to the [position<=k-1],
for 2+ stage, previous tokens in last stage is marked pre-computed, this is a bit complicated for k>1 on different layers, need to look into.

in short, the current change works for k=1 (which deepseek v3 model set), but need more changes for k>1

QQ: what's the shape of input_ids here?

the incremental length, here its [B*1]

Neo9061

Any way to put a MD file instructing examples on how to use the MTP for SD?

Especially,

The num_nextn_predict_layers is 1, can we specify speculation length more than 1? and what are requirements on formatting the draft model artifacts?
Is this code compatible with multi-node inference? assume so as the draft is loaded in single GPU?

Neo9061 · 2025-02-05T03:10:47Z

tests/spec_decode/e2e/test_mtp_correctness.py

The num_nextn_predict_layers in DeepSeek V3 has only 1. Will that mean you will reuse the MTP head if I specify MAX_SEC_TOKENS more than 1?

this is a test file on dummy model. num_speculative_tokens should be <= num_nextn_predict_layers, the transformer blocks are different in different steps. I am adding some assertion for this case when user pass higher number.

Is there a way to just re-use the MTP to predict tokens whose k > 1? as essentially they are the same right?

You can print some warning that this is not expected though.

Neo9061 · 2025-02-05T04:01:46Z

vllm/model_executor/models/deepseek_mtp.py

shouldn't the mtp_start_layer_idx be num_hidden_layers -1?

num_hidden_layers is 61 in DeepSeek config. The index of last layer is 60.

https://huggingface.co/deepseek-ai/DeepSeek-V3/raw/main/model.safetensors.index.json the last layer is 61 which is the mtp layer.

https://huggingface.co/deepseek-ai/DeepSeek-V3/raw/main/model.safetensors.index.json the last layer is 61 which is the mtp layer.

I see, thanks for clarifying!

Neo9061 · 2025-02-05T16:39:01Z

@luccafong Sorry have to ask those questions as I hope to use your implementation.

Have you tested it e2e with VLLM's multi-node distributed inference setting? asking as I can only deploy the model in multi-node settings.
If I want to re-use the MTP head to do speculation length k > 1, what is the hacking implementation you would recommend to just make it work? as k=1 is too limited in my application.

benchislett · 2025-02-05T18:30:43Z

@luccafong I have been working on a similar implementation locally, and have faced a few challenges that I'm not sure are addressed here. Have you validated the acceptance rate for k=1 for real weights?

I believe that the final RMSNorm in the DeepSeekV3 main model is not necessary for speculative decoding since the hnorm already normalizes the previous hidden weights received from the main model. It's unclear to me how it is classified in the DeepSeek-V3 technical report, but I think that the norm might be included in the output head and therefore not normalized as input to the MTP module. Anecdotally, I observe a small increase in acceptance rate with this change.

Also, I have noticed the acceptance rate becomes very low (<50%) when I enable the recently added MLA attention. Have you noticed this also? I am not sure what could cause this, maybe it is a bug fixed in recent commits to vLLM. I would like to know if this is an issue for your implementation.

vllm/transformers_utils/configs/deepseek_v3.py

luccafong · 2025-02-06T04:54:01Z

@luccafong I have been working on a similar implementation locally, and have faced a few challenges that I'm not sure are addressed here. Have you validated the acceptance rate for k=1 for real weights?

I believe that the final RMSNorm in the DeepSeekV3 main model is not necessary for speculative decoding since the hnorm already normalizes the previous hidden weights received from the main model. It's unclear to me how it is classified in the DeepSeek-V3 technical report, but I think that the norm might be included in the output head and therefore not normalized as input to the MTP module. Anecdotally, I observe a small increase in acceptance rate with this change.

Also, I have noticed the acceptance rate becomes very low (<50%) when I enable the recently added MLA attention. Have you noticed this also? I am not sure what could cause this, maybe it is a bug fixed in recent commits to vLLM. I would like to know if this is an issue for your implementation.

The accept rate is around 56% during my testing, MLA attention could lead to different branch,
https://github.com/luccafong/vllm/blob/ds_mtp/vllm/spec_decode/multi_step_worker.py#L98 I fixed in a later commit.

regarding the norm, thanks for pointing out, let me try adjusting to see if there is an improvement.

luccafong · 2025-02-06T04:57:29Z

@luccafong Sorry have to ask those questions as I hope to use your implementation.

Have you tested it e2e with VLLM's multi-node distributed inference setting? asking as I can only deploy the model in multi-node settings.

If I want to re-use the MTP head to do speculation length k > 1, what is the hacking implementation you would recommend to just make it work? as k=1 is too limited in my application.

1.Not tested with multi node settings; 2. We can reuse if you do some model processing, e.g. duplicate the weights to different layers, the hacky changes will not be proper since for n predict layers >1, and we do a k > n predict layers, it is difficult to decide which layer to forward multiple times.
Note for now as commented in the other thread, some changes are needed for K>1, I am working in progress, let me update with you if it works.

mergify · 2025-02-06T07:17:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @luccafong.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Neo9061 · 2025-02-06T17:46:27Z

vllm/model_executor/models/deepseek_mtp.py

Please excuse my multiple questions.

inputs_embeds[positions <= spec_step_index] = 0 is for pre-filling stage for each MTP head correct? as during draft model (MTP head) decoding stage, the inputs_embeds is a single hidden vector.

That is what I saw in EAGLE workflow. It firstly enters the code from here with num_steps being 1 for prefilling (that is where the mask is effective). Then the num_steps becomes to be speculation length k and inputs_embed for each forward pass is a single embed vector.

But I didn't see your logic is modified in

vllm/vllm/spec_decode/draft_model_runner.py

Line 271 in 467a96a

for step in range(num_steps):

to introduce spec_step_index, where do you introduce it then?

yangchou19 · 2025-02-24T15:40:57Z

@luccafong Hi thanks for your great work! I ran deepseek-r1 in 2 x 8 H100 ray clusters, but encountered CUDA error: invalid device ordinal, could you help take a look this issue? The ray status is normal, thanks

Run script:

vllm serve  deepseek-ai/DeepSeek-R1\
        --host 0.0.0.0 \
        --port 8081 \
        --tensor-parallel-size 16 \
        --pipeline-parallel-size 1 \
        --gpu-memory-utilization 0.8 \
        --max-num-seqs 16 \
        --max-model-len 32768 \
        --served-model-name deepseek_r1 \
        --device cuda \
        --quantization fp8 \
        --trust-remote-code \
        --num-speculative-tokens 1

Error message:

ERROR 02-19 05:03:47 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
ERROR 02-19 05:03:47 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
ERROR 02-19 05:03:47 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 582, in execute_method^M
ERROR 02-19 05:03:47 engine.py:389]     raise e^M
ERROR 02-19 05:03:47 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 573, in execute_method^M
ERROR 02-19 05:03:47 engine.py:389]     return run_method(target, method, args, kwargs)^M
ERROR 02-19 05:03:47 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
ERROR 02-19 05:03:47 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2196, in run_method^M
ERROR 02-19 05:03:47 engine.py:389]     return func(*args, **kwargs)^M
ERROR 02-19 05:03:47 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^M
ERROR 02-19 05:03:47 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 368, in init_device^M
ERROR 02-19 05:03:47 engine.py:389]     self.spec_decode_sampler.init_tensors(self.rank,^M
ERROR 02-19 05:03:47 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/spec_decode_base_sampler.py", line 56, in init_tensors^M
ERROR 02-19 05:03:47 engine.py:389]     self.num_accepted_tokens = torch.tensor(0,^M
ERROR 02-19 05:03:47 engine.py:389]                                ^^^^^^^^^^^^^^^^M
ERROR 02-19 05:03:47 engine.py:389] RuntimeError: CUDA error: invalid device ordinal^M
ERROR 02-19 05:03:47 engine.py:389] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.^M
ERROR 02-19 05:03:47 engine.py:389] For debugging consider passing CUDA_LAUNCH_BLOCKING=1^M
ERROR 02-19 05:03:47 engine.py:389] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.^M

I encountered an “out of memory” error while running DeepSeek-R1 on 2 machines with 8 *H20 GPUs each. Has anyone successfully run DeepSeek-R1 on H20 GPUs?

Here is the command I used:

 vllm serve  deepseek-ai/DeepSeek-R1\
        --host 0.0.0.0 \
        --port 8081 \
        --tensor-parallel-size 16 \
        --pipeline-parallel-size 1 \
        --gpu-memory-utilization 0.99 \
        --max-model-len 131072 \
        --served-model-name deepseek_r1 \
        --quantization fp8 \
        --trust-remote-code \
        --num-speculative-tokens 1

benchislett · 2025-02-24T16:03:43Z

@yangchou19 Your gpu memory utilization is set too high. As I understand, the weights for speculative decoding are not currently accounted-for in the memory profiler so there must be additional memory leftover from the vllm allocation for it to live in. As a workaround, try decreasing the gpu memory utilization progressively (try 0.95, 0.9, 0.88, 0.85) until it succeeds. If you want to use speculative decoding, you may need to decrease max-model-len to ensure there is enough memory available for a large KV cache.

hxt365 · 2025-02-27T12:26:45Z

Is this applicable for R1 Distill models? I got this error for deepseek-ai/DeepSeek-R1-Distill-Qwen-32B: ValueError: num_speculative_tokens was provided without speculative_model

mgoin · 2025-02-27T13:39:46Z

@hxt365 No the Distill models do not have MTP modules like DeepSeek V3/R1

KiroSummer · 2025-03-01T04:01:53Z

Great work on the analysis! I wanted to clarify one point regarding the baseline vs. MTP performance comparison. Given that the speculative decoding worker implementation doesn't support asynchronous output processing or multi-step scheduling, could you confirm whether these two optimizations were utilized when calculating the TPOT metric for the baseline model?

JoeyYoung · 2025-03-03T05:21:26Z

Hi all, I try to use speculative decoding with tp=8 and pp=2 on the 2 x 8H20 testbed, with following command:

vllm serve /vllm-workspace/DeepSeek-R1/ \
        --host 0.0.0.0 \
        --port 8081 \
        --tensor-parallel-size 8 \
        --pipeline-parallel-size 2 \
        --gpu-memory-utilization 0.8 \
        --max-num-seqs 16 \
        --max-model-len 32768 \
        --served-model-name deepseek_r1 \
        --device cuda \
        --quantization fp8 \
        --trust-remote-code \
        --num-speculative-tokens 1

But it reports the error:

INFO 02-28 00:11:01 [config.py:334] Overriding HF config with <function SpeculativeConfig.hf_config_override at 0x7f1bcd2a7880>
Traceback (most recent call last):
  File "/root/anaconda3/envs/vllm/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 73, in main
    args.dispatch_function(args)
  File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 34, in cmd
    uvloop.run(run_server(args))
  File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/root/anaconda3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 946, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/root/anaconda3/envs/vllm/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 138, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/root/anaconda3/envs/vllm/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 162, in build_async_engine_client_from_engine_args
    engine_client = AsyncLLMEngine.from_engine_args(
  File "/vllm-workspace/vllm/vllm/engine/async_llm_engine.py", line 639, in from_engine_args
    engine_config = engine_args.create_engine_config(usage_context)
  File "/vllm-workspace/vllm/vllm/engine/arg_utils.py", line 1237, in create_engine_config
    speculative_config = SpeculativeConfig.maybe_create_spec_config(
  File "/vllm-workspace/vllm/vllm/config.py", line 2025, in maybe_create_spec_config
    return SpeculativeConfig(
  File "/vllm-workspace/vllm/vllm/config.py", line 2199, in __init__
    self._verify_args()
  File "/vllm-workspace/vllm/vllm/config.py", line 2207, in _verify_args
    self.draft_model_config.verify_with_parallel_config(
  File "/vllm-workspace/vllm/vllm/config.py", line 762, in verify_with_parallel_config
    raise NotImplementedError(
NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the `SupportsPP` interface.

Is pipeline parallelism not supported for the draft model?

…12755) Signed-off-by: Lu Fang <fanglu@fb.com> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

BoyuanS · 2025-03-06T05:15:51Z

When I started deepseek-r1-awq int4 and set --num-speculative-tokens 1, generated contents are all empty after . Any idea why this should happen? The model works well when num-speculative-tokens is not set

Pokemons386 · 2025-03-06T09:20:03Z

vllm/model_executor/models/deepseek_mtp.py

+        previous_hidden_states = self.hnorm(previous_hidden_states)
+
+        hidden_states = self.eh_proj(
+            torch.cat([inputs_embeds, previous_hidden_states], dim=-1))


In 4tokens prefill case, the main model forward token[0:3] and get HS[0:3] & token4.Then MTP forward token[0:4] (token0 masked), how token[0:4] and HS[0:3] be torch.cat()?

See prepare_prefill_hidden_states, which rotates the hidden states such that they match up with the tokens.

benchislett · 2025-03-06T15:44:10Z

@BoyuanS It is likely that your AWQ quantization did not include weights for the MTP head, in which case this will not work.

…12755) Signed-off-by: Lu Fang <fanglu@fb.com> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Mainly refer to: [Model][Speculative Decoding] DeepSeek MTP spec decode (vllm-project#12755) Enable MTP for HPU from deepseek_r1_upstream branch

[Model][Speculative Decoding] DeepSeek MTP spec decode (vllm-project#12755) Signed-off-by: Lu Fang <fanglu@fb.com> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

Mainly referred to: 401d1f2 Author: Chendi Xue <chendi.xue@intel.com> And [Model][Speculative Decoding] DeepSeek MTP spec decode (vllm-project#12755) Signed-off-by: Lu Fang <fanglu@fb.com> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> ---------

…12755) Signed-off-by: Lu Fang <fanglu@fb.com> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

parambole · 2025-07-09T03:35:23Z

Hey @luccafong @mgoin @LiuXiaoxuanPKU I am currently working on integrating Deepseek's Multi-Token Prediction into Maxtext.

Question:

As part of this PR, has the team been able to load the open MTP Deepseek V3 weights and analyze the implementation during pre-training and fine-tuning? I am curious to know about the observed behavior?

mgoin · 2025-07-09T23:59:31Z

I have no experience with training, sorry!

parambole · 2025-07-10T02:39:59Z

I have no experience with training, sorry!

Hey @mgoin - thanks for responding. With respect to inference how is the performance of the predicted token with loading the deepseek MTP published weights ?

Specifically I am observing: deepseek-ai/DeepSeek-V3#928

luccafong requested review from DarkLight1337, LiuXiaoxuanPKU, alexm-redhat, comaniac, njhill, youkaichao, ywang96 and zhuohan123 as code owners February 4, 2025 23:58

mergify bot added the speculative-decoding label Feb 4, 2025

luccafong force-pushed the ds_mtp branch from fd7ff28 to d8a95cf Compare February 5, 2025 00:06

LiuXiaoxuanPKU self-assigned this Feb 5, 2025

luccafong force-pushed the ds_mtp branch from d8a95cf to 9de0bdf Compare February 5, 2025 00:27

comaniac reviewed Feb 5, 2025

View reviewed changes

LiuXiaoxuanPKU reviewed Feb 5, 2025

View reviewed changes

lambert0312 mentioned this pull request Feb 5, 2025

[Feature] DeepSeek V3 optimization sgl-project/sglang#2591

Closed

20 tasks

luccafong force-pushed the ds_mtp branch 3 times, most recently from d5a9924 to 793ef4f Compare February 5, 2025 02:30

Neo9061 reviewed Feb 5, 2025

View reviewed changes

simon-mo reviewed Feb 5, 2025

View reviewed changes

vllm/transformers_utils/configs/deepseek_v3.py Outdated Show resolved Hide resolved

luccafong force-pushed the ds_mtp branch from 793ef4f to 9afc75e Compare February 5, 2025 23:47

mergify bot added the needs-rebase label Feb 6, 2025

Neo9061 reviewed Feb 6, 2025

View reviewed changes

ShangmingCai mentioned this pull request Feb 21, 2025

[Bug]: Speculative decoding reports errors when loading target model using distributed inference (VLLM's offical Ray setup) #12841

Closed

1 task

luccafong mentioned this pull request Feb 25, 2025

[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict #13626

Merged

tczekajlo mentioned this pull request Feb 26, 2025

[Bug]: Speculative Decoding doesn't work with Ray compiled DAG and SPMD #13682

Closed

1 task

elinx mentioned this pull request Mar 3, 2025

[Bug]: MTP Implementation Inconsistency Between DeepSeek Paper and vllm #14137

Closed

1 task

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[Model][Speculative Decoding] DeepSeek MTP spec decode (vllm-project#…

25f0d87

…12755) Signed-off-by: Lu Fang <fanglu@fb.com> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

Pokemons386 reviewed Mar 6, 2025

View reviewed changes

mengwei805 mentioned this pull request Mar 10, 2025

v0.7.3 support speculative decoding vllm-project/vllm-ascend#252

Merged

luyuzhe111 mentioned this pull request Mar 11, 2025

[Bug]: EAGLE / DeepSeek MTP Handles First Input Token Incorrectly - 25% Acceptance Rate Drop #14647

Closed

1 task

czhu15 added a commit to HabanaAI/vllm-fork that referenced this pull request Apr 21, 2025

enable MTP on Gaudi

1f98a7e

Mainly refer to: [Model][Speculative Decoding] DeepSeek MTP spec decode (vllm-project#12755) Enable MTP for HPU from deepseek_r1_upstream branch

czhu15 mentioned this pull request Apr 22, 2025

Enable MTP on Gaudi HabanaAI/vllm-fork#1141

Merged

mgoin mentioned this pull request May 2, 2025

[Feature]: Multi-Token Prediction (MTP) #12181

Closed

1 task

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Model][Speculative Decoding] DeepSeek MTP spec decode (vllm-project#…

ed78fc9

…12755) Signed-off-by: Lu Fang <fanglu@fb.com> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

Uh oh!

[Model][Speculative Decoding] DeepSeek MTP spec decode #12755

[Model][Speculative Decoding] DeepSeek MTP spec decode #12755

Uh oh!

Conversation

luccafong commented Feb 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 4, 2025

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Neo9061 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Neo9061 commented Feb 5, 2025

Uh oh!

benchislett commented Feb 5, 2025

Uh oh!

Uh oh!

luccafong commented Feb 6, 2025

Uh oh!

luccafong commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Feb 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yangchou19 commented Feb 24, 2025

Uh oh!

benchislett commented Feb 24, 2025

Uh oh!

hxt365 commented Feb 27, 2025

Uh oh!

mgoin commented Feb 27, 2025

Uh oh!

KiroSummer commented Mar 1, 2025

Uh oh!

JoeyYoung commented Mar 3, 2025

Uh oh!

BoyuanS commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pokemons386 Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benchislett commented Mar 6, 2025

luccafong commented Feb 4, 2025 •

edited by github-actions bot

Loading

Neo9061 left a comment •

edited

Loading

luccafong commented Feb 6, 2025 •

edited

Loading

BoyuanS commented Mar 6, 2025 •

edited

Loading

Pokemons386 Mar 6, 2025 •

edited

Loading

parambole commented Jul 10, 2025 •

edited

Loading