-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VLM] Calculate maximum number of multi-modal tokens by model #6121
Changes from 3 commits
67d8720
8987afa
1be8d3d
99758ef
d6fe62b
8578571
f47689e
a7a6a61
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -321,6 +321,17 @@ def get_phi3v_image_feature_size( | |
+ (new_height // 336 + 1) * 12 | ||
|
||
|
||
def get_max_phi3v_image_tokens(ctx: InputContext): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've made #6146 to address this. |
||
# Result in the max possible feature size (h:w = 16:1) | ||
dummy_height, dummy_width = 8000, 50 | ||
|
||
return get_phi3v_image_feature_size( | ||
ctx.get_hf_config(PretrainedConfig), | ||
input_height=dummy_height, | ||
input_width=dummy_width, | ||
) | ||
|
||
|
||
def dummy_data_for_phi3v(ctx: InputContext, seq_len: int): | ||
# Result in the max possible feature size (h:w = 16:1) | ||
dummy_height, dummy_width = 8000, 50 | ||
|
@@ -429,6 +440,7 @@ def input_processor_for_phi3v(ctx: InputContext, llm_inputs: LLMInputs): | |
|
||
|
||
@MULTIMODAL_REGISTRY.register_image_input_mapper() | ||
@MULTIMODAL_REGISTRY.register_max_image_tokens(get_max_phi3v_image_tokens) | ||
@INPUT_REGISTRY.register_dummy_data(dummy_data_for_phi3v) | ||
@INPUT_REGISTRY.register_input_processor(input_processor_for_phi3v) | ||
class Phi3VForCausalLM(nn.Module, SupportsVision): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the relationship between
register_max_tokens
andregister_dummy_data
is a bit intricate. There needs to be certain level of consistency here. Hard to get right. Should we mention something here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I currently have a note in
registry_dummy_data
that mentions it should use the max number of tokens from each modality. Is that sufficient?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the two should be tied together for consistency - see my comment below in
phi3v.py
.