-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - Support Hybrid Mode for Llama3.2 #7175
Qualcomm AI Engine Direct - Support Hybrid Mode for Llama3.2 #7175
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7175
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit db86fbd with merge base f370e78 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @cccclai, I wonder if you have plan to support reuse of processed bytes? Looks like current approach only works for processed bytes staying inside a method rather than across all methods executorch/exir/emit/_emitter.py Line 1066 in 8861b9a
This might be critical to the .pte size since we have different methods all point to the same processed binary. Any suggestion for us to work this out?
|
Hmm I'm trying to follow - in #6657, I remember we observe size reduction, how is it different than the test example added in the PR? |
Yes, the context binary size will be smaller than the version without
Hopefully this makes sense to you. |
oh hmm, I thought this line
will just deduplicate directly, unless they're not bit-exact match. Did I misunderstand anything? |
What is the |
No, your understanding is correct. But I think the # emit each entry point in order according to name.
for name, exported_program in sorted(methods.items()):
# create empty state
emitter_state = _EmitterState(
values=[],
operators=[],
delegates=[],
operator_cache={},
delegate_cache={},
emit_stacktrace=emit_stacktrace,
) Our scenario for |
Hmm I feel like something is off. If we look at the executorch .pte schema, delegate is outside of each method https://github.com/pytorch/executorch/blob/main/exir/schema.py#L292 |
There is a chance that maybe we have a bug somewhere internal, I'd need to check. |
This is the patch works on my side: diff --git a/backends/qualcomm/utils/utils.py b/backends/qualcomm/utils/utils.py
index 33be00ed5..fcb7eb667 100644
--- a/backends/qualcomm/utils/utils.py
+++ b/backends/qualcomm/utils/utils.py
@@ -733,7 +733,7 @@ def from_context_binary( # noqa: C901
bundle_prog = build_graph(inputs, outputs)
bundle_prog.update({"inputs": inputs, "outputs": outputs})
edge_prog_mgr = to_edge(
- programs={graph_name: bundle_prog["exported_program"]},
+ {graph_name: bundle_prog["exported_program"]},
# do not alter name for custom op
compile_config=EdgeCompileConfig(_use_edge_ops=False),
)
@@ -791,7 +791,7 @@ def generate_multi_graph_program(
]
# leverage ExecutorchProgramManager for generating pte with multi-methods
edge_prog_mgr = to_edge(
- programs={
+ {
graph_name: bundle_prog["exported_program"]
for graph_name, bundle_prog in zip(graph_names, bundle_progs)
},
diff --git a/exir/_serialize/_program.py b/exir/_serialize/_program.py
index 00a3d4700..26e28589c 100644
--- a/exir/_serialize/_program.py
+++ b/exir/_serialize/_program.py
@@ -254,6 +254,7 @@ def _extract_delegate_segments(
"""
remaining_inline: List[BackendDelegateInlineData] = []
inline_indices_seen: set[int] = set()
+ segment_index_map: dict[bytes, int] = {}
for plan in program.execution_plan:
for delegate in plan.delegates:
if delegate.processed.location != DataLocation.INLINE:
@@ -279,8 +280,12 @@ def _extract_delegate_segments(
inline_indices_seen.add(delegate.processed.index)
if inline.data:
# Move the delegate data out of the program.
- segment_index = len(segments)
- segments.append(Cord(inline.data))
+ segment_index = segment_index_map.get(inline.data)
+ if segment_index is None:
+ segment_index = len(segments)
+ segments.append(Cord(inline.data))
+ segment_index_map[inline.data] = segment_index
+
delegate.processed = BackendDelegateDataReference(
location=DataLocation.SEGMENT,
index=segment_index,
diff --git a/exir/emit/_emitter.py b/exir/emit/_emitter.py
index 381bab618..cf19c0479 100644
--- a/exir/emit/_emitter.py
+++ b/exir/emit/_emitter.py
@@ -119,6 +119,7 @@ class _ProgramState:
# Delegate data stored directly in the flatbuffer. Pointed to by BackendDelegateDataReference,
# and should be copied to Program.backend_delegate_data.
backend_delegate_data: List[BackendDelegateInlineData] = field(default_factory=list)
+ backend_delegate_data_cache: Dict[bytes, int] = field(default_factory=dict)
@dataclass
@@ -1049,10 +1050,13 @@ class _Emitter(torch.fx.Interpreter):
if delegate_index is None:
# Allocate an entry for the data. TODO(T150113674): Reuse any duplicate entries if
# present.
- data_index: int = len(self.program_state.backend_delegate_data)
- self.program_state.backend_delegate_data.append(
- BackendDelegateInlineData(data=processed_bytes)
- )
+ data_index = self.program_state.backend_delegate_data_cache.get(processed_bytes)
+ if data_index is None:
+ data_index: int = len(self.program_state.backend_delegate_data)
+ self.program_state.backend_delegate_data_cache[processed_bytes] = data_index
+ self.program_state.backend_delegate_data.append(
+ BackendDelegateInlineData(data=processed_bytes)
+ )
backend_delegate = BackendDelegate(
id=lowered_module.backend_id, You could inspect the generated python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNQuantizedOperators.test_qnn_backend_multi_graphs -s $DEVICE_SERIAL -m $SOC_MODEL -b build-android/ -a $PATH_TO_ARTIFACTS |
Hi Chen, Just in case you still encounter problem during runtime. Please feel free to let us know. We will try to assist you. |
Hi @chunit-quic, thank you for offering, I'm preparing an internal design review tomorrow for supporting multimethods, such that we don't need to work around for weight sharing. I'll get back to this repro right away. |
Hi @cccclai, |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I've verified the stories model PR and batch prefill/decode standalone seems working well. Trying the 1b model |
In the meanwhile, looks like we run into some flatbuffers compilation issue for the internal build,
Is |
If alignement is required, can we do something similar to Line 392 in 4c43c86
|
flatbuffers::FlatBufferBuilder builder_; | ||
flatbuffers::Verifier::Options fb_opt_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one flatbuffers::Verifier::Options
is also causing build error, maybe the flatbuffers version is different...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like internally we pinged to https://github.com/google/flatbuffers/commits/338393f8/ Any chance we can limit using features before this version? If not, we may need to add patches for flatbuffers::Verifier::Options
and FlatBufferBuilder64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also try to bump the internal flatbuffers version..will provide more info after trying it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drafted an internal diff with the flatbuffers version bump - will watch the CI signal and hopefully it can pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cccclai,
Thanks for reviewing the PR. We tried to use the original 32-bit flatbuffer, however, for larger models, we encounter errors complaining 32bit flatbuffer size is insufficient. The 64bit flatbuffer allows size to be larger than 2 GB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, did you use wrap the context binary inside flatbuffers? That's likely the reason. In executorch, we move the context binary outside of flatbuffers so it's no longer limited by the 32-bit flatbuffers #885 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI seems fine to bump flatbuffers version..I'll try get it landed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we tried wrapping the context binary inside flatbuffer. If the version bump does not work, we could try moving it outside of flatbuffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the CI is really bad when I bump the flatbuffers version...we tried to bump flatbuffers version before and it didn't go well. I guess we can try again, but no guarantee..
I believe force alignment is not required for our case. |
645fb63
to
43dfe13
Compare
I was able to repro the following number,
for input=31
but the model size is
|
Thank you for providing the numbers. Regarding the model size, I believe @haowhsu-quic mentioned this at the top of the PR. While the context binary is indeed smaller, the PTE size is not. It would be appreciated if you could review the patch above. If you think it works, we can create another PR for it. |
I'm using qnn 2.26. In the meanwhile, I think apply the patch above. |
I actually test the patch and it looks correct to me, though i'm still not sure why the .pte size double.. |
Summary: Reported by pytorch#7175 that the delegate is not deduplicate when they're exactly the same Differential Revision: D67067997
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
The CI looks clean now! A separate question, does the quantized_linear in qnn use the quantized weight from prefill or decode? |
If you are referring to the weights for linear op, I believe that the prefill and decode should have the same quantized weights, which means it should be using weight sharing. |
Yeah it's for linear op. However, prefill and decode are calibrated differently, meaning the quantized weights may not be the same? |
I believe the weights scale/offset should be the same, and that's why we can achieve almost half the pte size when enabling weight sharing. The fp value for the weights should always be the same for both modes, so we should always be getting the same scale/offset for both modes. On the other hand, activation's scale/offset will be different due to different input's during calibration. |
Just another separate question. We are still unable to achieve great accuracy for hybrid mode by applying PTQ. We are looking forward to enabling QAT on static llama, and we would like to know if there are currently any barriers or difficulties that you are facing. Thanks Current results that we are getting: what is 1+1<|eot_id|><|start_header_id|>assistant<|end_header_id|> There is one or one!<|eot_id|>" |
Yeah we've done some work on QAT. By eyeballing the result is reasonable. Currently we QAT the weight and calibrate prefill then decode seperately, however, the model size increase and we're trying figure out if it's due to the quantized weight difference. |
I would like to confirm if you can reproduce the model size reduction with the patch @haowhsu-quic shared in this thread. If applied correctly on this PR, we should see a size reduction in the 16a4w 1B hybrid mode, from approximately 2.2GB to 1.1GB. Additionally, I would like to confirm that QAT is performed only once and the fp weights(before calibration) for prefill and kv mode are the same. If the fp weights are the same, the calibration process should generate the exact same scale/offset, leading to significant size reduction through weight sharing. Thanks. |
With this pr, yes I observe the smaller size down to 1.1GB. But after we apply QAT + PTQ, the size seems to be larger, up to 1.6 GB |
The fp weights will be the same, however, they will be calibrated differently (calibrate prefill and calibrate decode), meaning the quantization parameters (the scale and zero point) might be slightly different. |
I think by calibrating the data with only PTQ, we do get the same scale/offset for weights, which aligns with the size of reduction from 2.2GB to 1.1GB. |
QAT is only performed once, and PTQ will perform twice, one with the prefill model and one with the decode model |
How long did it take to export a model currently? I have been waiting for 2 hours.... |
For my case where I do 512kv and 32prefill, I think it took about 3 hours. The time required to export model can vary depending on the prefill/kv sequence length provided. |
For the issue that some weights have a slight difference, would you mind providing a draft PR, so we can try to reproduce the issue and see if it can be resolved? We can use dummy inputs to test the scale/offset. |
Yeah let me see how to repro.. Additionally, it seems like the model is bigger with the new commit. Without applying #7281, the model bumps to 4 GB, after applying, it's 2GB |
Thanks! |
I'm currently getting the following data with #7281
|
This is the only change we made for calibration.
|
I have applied the calibration patch you provided above and #7281. |
@winskuo-quic yes that wil be very helpful. In the meanwhile, looks like the decoding speed from hybrid model is a bit slower than the previous decode speed from kv mode. Do we know the reason? |
Great! |
Summary: Reported by pytorch#7175 that the delegate is not deduplicate when they're exactly the same Differential Revision: D67067997
Summary: Reported by pytorch#7175 that the delegate is not deduplicate when they're exactly the same Differential Revision: D67067997
Summary: Reported by pytorch#7175 that the delegate is not deduplicate when they're exactly the same Differential Revision: D67067997
Summary: Reported by pytorch#7175 that the delegate is not deduplicate when they're exactly the same Reviewed By: tarun292 Differential Revision: D67067997
Summary
Script to run prefill 32 context_length and kv 512 context_length
python examples/qualcomm/oss_scripts/llama3_2/llama.py -a ${ARCHIVE} -b build-android -H ${HOST} -m ${SOC} --checkpoint Llama3.2-1B-Instruct/consolidated.00.pth --params Llama3.2-1B-Instruct/params.json --tokenizer_model Llama3.2-1B-Instruct/tokenizer.model --prompt "what is 1+1" --temperature 0 --model_size 1B --model_mode hybrid --prefill_seq_len 32 --kv_seq_len 512 --ptq 16a4w