Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pipeline able to load processor #32514

Merged
merged 37 commits into from
Oct 9, 2024
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
a3465ce
Refactor get_test_pipeline
qubvel Aug 7, 2024
e377b29
Fixup
qubvel Aug 7, 2024
500098d
Fixing tests
qubvel Aug 7, 2024
bab9a57
Add processor loading in tests
qubvel Aug 7, 2024
7fd209f
Restructure processors loading
qubvel Aug 7, 2024
3db0e0b
Add processor to the pipeline
qubvel Aug 6, 2024
71c2d5b
Move model loading on tom of the test
qubvel Aug 8, 2024
a010357
Update `get_test_pipeline`
qubvel Aug 8, 2024
c526e08
Fixup
qubvel Aug 8, 2024
d45d7f6
Add class-based flags for loading processors
qubvel Aug 8, 2024
a95d556
Change `is_pipeline_test_to_skip` signature
qubvel Aug 8, 2024
94f5616
Skip t5 failing test for slow tokenizer
qubvel Aug 8, 2024
2bd4e0e
Fixup
qubvel Aug 8, 2024
49ec283
Fix copies for T5
qubvel Aug 9, 2024
5686833
Fix typo
qubvel Aug 9, 2024
0a2349e
Add try/except for tokenizer loading (kosmos-2 case)
qubvel Aug 9, 2024
7d507a6
Fixup
qubvel Aug 9, 2024
3873c1c
Llama not fails for long generation
qubvel Aug 9, 2024
92d2be8
Revert processor pass in text-generation test
qubvel Aug 9, 2024
01d6040
Fix docs
qubvel Aug 9, 2024
ab7a229
Switch back to json file for image processors and feature extractors
qubvel Aug 22, 2024
6e0dde8
Add processor type check
qubvel Aug 22, 2024
4c834b3
Remove except for tokenizers
qubvel Aug 22, 2024
6a8e590
Fix docstring
qubvel Oct 1, 2024
0533995
Fix empty lists for tests
qubvel Oct 1, 2024
e06053d
Fixup
qubvel Oct 1, 2024
d553dab
Fix load check
qubvel Oct 2, 2024
6ff4e68
Ensure we have non-empty test cases
qubvel Oct 2, 2024
a6993b5
Update src/transformers/pipelines/__init__.py
qubvel Oct 3, 2024
5799775
Update src/transformers/pipelines/base.py
qubvel Oct 3, 2024
9793e01
Rework comment
qubvel Oct 3, 2024
e712717
Better docs, add note about pipeline components
qubvel Oct 4, 2024
bcce4dc
Change warning to error raise
qubvel Oct 7, 2024
64f002c
Fixup
qubvel Oct 7, 2024
cbb813f
Merge branch 'main' into add-processor-to-pipeline
qubvel Oct 7, 2024
6996695
Refine pipeline docs
qubvel Oct 7, 2024
b67d24f
Merge branch 'main' into add-processor-to-pipeline
qubvel Oct 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 69 additions & 2 deletions src/transformers/pipelines/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@
from ..models.auto.feature_extraction_auto import FEATURE_EXTRACTOR_MAPPING, AutoFeatureExtractor
from ..models.auto.image_processing_auto import IMAGE_PROCESSOR_MAPPING, AutoImageProcessor
from ..models.auto.modeling_auto import AutoModelForDepthEstimation, AutoModelForImageToImage
from ..models.auto.processing_auto import PROCESSOR_MAPPING, AutoProcessor
from ..models.auto.tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer
from ..processing_utils import ProcessorMixin
from ..tokenization_utils import PreTrainedTokenizer
from ..utils import (
CONFIG_NAME,
Expand Down Expand Up @@ -556,6 +558,7 @@ def pipeline(
tokenizer: Optional[Union[str, PreTrainedTokenizer, "PreTrainedTokenizerFast"]] = None,
feature_extractor: Optional[Union[str, PreTrainedFeatureExtractor]] = None,
image_processor: Optional[Union[str, BaseImageProcessor]] = None,
processor: Optional[Union[str, ProcessorMixin]] = None,
framework: Optional[str] = None,
revision: Optional[str] = None,
use_fast: bool = True,
Expand Down Expand Up @@ -644,6 +647,25 @@ def pipeline(
`model` is not specified or not a string, then the default feature extractor for `config` is loaded (if it
is a string). However, if `config` is also not given or not a string, then the default feature extractor
for the given `task` will be loaded.
image_procesor (`str` or [`BaseImageProcessor`], *optional*):
qubvel marked this conversation as resolved.
Show resolved Hide resolved
The image processor that will be used by the pipeline to preprocess images for the model. This can be a
model identifier or an actual image processor inheriting from [`BaseImageProcessor`].

Image processors are used for Vision models and multi-modal models that require image inputs. Multi-modal
models will also require a tokenizer to be passed.

If not provided, the default image processor for the given `model` will be loaded (if it is a string). If
`model` is not specified or not a string, then the default image processor for `config` is loaded (if it is
a string).
processor (`str` or [`ProcessorMixin`], *optional*):
The processor that will be used by the pipeline to preprocess data for the model. This can be a model
identifier or an actual processor inheriting from [`ProcessorMixin`].

Processors are used for multi-modal models that require multi-modal inputs, for example, a model that
requires both text and image inputs.

If not provided, the default processor for the given `model` will be loaded (if it is a string). If `model`
is not specified or not a string, then the default processor for `config` is loaded (if it is a string).
Comment on lines +659 to +676
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are now a few overlapping inputs:

  • tokenizer
  • feature extractor
  • image processor
  • processor

I believe it would be nice to highlight somewhere visible (like in the documentation above) what attribute is necessary for what: at no point should a user specify all four of them, for example.

Copy link
Member Author

@qubvel qubvel Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a separate Note section to highlight that we should not provide all types of processors at once
e712717 and refer to a specific pipeline in case one would like to provide them explicitly.

For each specific pipeline we have only required processors args in docs section configured with docs decorator. e.g. here

build_pipeline_init_args(has_image_processor=True),

Copy link
Member Author

@qubvel qubvel Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, updated pipeline doc to more relevant one in 6996695

framework (`str`, *optional*):
The framework to use, either `"pt"` for PyTorch or `"tf"` for TensorFlow. The specified framework must be
installed.
Expand Down Expand Up @@ -905,13 +927,29 @@ def pipeline(

model_config = model.config
hub_kwargs["_commit_hash"] = model.config._commit_hash

load_tokenizer = (
type(model_config) in TOKENIZER_MAPPING
or model_config.tokenizer_class is not None
or isinstance(tokenizer, str)
)
load_feature_extractor = type(model_config) in FEATURE_EXTRACTOR_MAPPING or feature_extractor is not None
load_image_processor = type(model_config) in IMAGE_PROCESSOR_MAPPING or image_processor is not None
load_feature_extractor = (
type(model_config) in FEATURE_EXTRACTOR_MAPPING
or feature_extractor is not None
or isinstance(feature_extractor, str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this conditional is redundant - if feature_extractor is a str then it will always be not None as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in d553dab

)
load_image_processor = (
type(model_config) in IMAGE_PROCESSOR_MAPPING
or image_processor is not None
or isinstance(image_processor, str)
qubvel marked this conversation as resolved.
Show resolved Hide resolved
)
load_processor = type(model_config) in PROCESSOR_MAPPING or processor is not None or isinstance(processor, str)
qubvel marked this conversation as resolved.
Show resolved Hide resolved

# Check that pipeline class required loading
load_tokenizer = load_tokenizer and pipeline_class._load_tokenizer
load_feature_extractor = load_feature_extractor and pipeline_class._load_feature_extractor
load_image_processor = load_image_processor and pipeline_class._load_image_processor
load_processor = load_processor and pipeline_class._load_processor
Comment on lines +944 to +948
Copy link
Member Author

@qubvel qubvel Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backward compatibility, we can control with Pipeline class if we need to load specific processors/tokenizers. For example, for zero-shot object detection, we will need to load only the processor, and do not need to load image_processor and tokenizer separately. Other legacy pipelines might load only tokenizer and image_processor, even if they have processor class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Piggy-backing on the comment above, this is likely something we want to highlight very clearly in each pipeline's documentation


# If `model` (instance of `PretrainedModel` instead of `str`) is passed (and/or same for config), while
# `image_processor` or `feature_extractor` is `None`, the loading will fail. This happens particularly for some
Expand Down Expand Up @@ -1074,6 +1112,32 @@ def pipeline(
if not is_pyctcdecode_available():
logger.warning("Try to install `pyctcdecode`: `pip install pyctcdecode")

if load_processor:
# Try to infer processor from model or config name (if provided as str)
if processor is None:
if isinstance(model_name, str):
processor = model_name
elif isinstance(config, str):
processor = config
else:
# Impossible to guess what is the right processor here
raise Exception(
"Impossible to guess which processor to use. "
"Please provide a processor instance or a path/identifier "
"to a processor."
)

# Instantiate processor if needed
if isinstance(processor, (str, tuple)):
processor = AutoProcessor.from_pretrained(processor, _from_pipeline=task, **hub_kwargs, **model_kwargs)
if not isinstance(processor, ProcessorMixin):
warnings.warn(
f"Processor will be not loaded, because {processor} is not an instance of `ProcessorMixin`. "
f"Got type `{type(processor)}` instead.",
UserWarning,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With transformers already having too many warnings, I'd be cautious about the ones we add.

What purpose does this warning serve? Is it sufficiently actionable? Does it concern users (the ones that will see it), or repo owners/creators that have not configured their processors/feature extractors correctly (that will likely not see this warning)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the questions!
There was a discussion here. The main idea is that AutoProcessor can load not only processors but also basic processing classes, like tokenizer.

Indeed, a misconfiguration in the model-pipeline-processor setup could trigger this and raising a warning + dropping the processor might be sufficient only when the processor isn't needed in the pipeline at all.

However, with granular control, such a case shouldn't occur. Therefore, it seems more appropriate to replace the warning with an error to clearly indicate a misconfiguration. Otherwise, the error will happen later with a less clear message because the processor is None.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to error raise here bcce4dc

processor = None

if task == "translation" and model.config.task_specific_params:
for key in model.config.task_specific_params:
if key.startswith("translation"):
Expand All @@ -1099,4 +1163,7 @@ def pipeline(
if device is not None:
kwargs["device"] = device

if processor is not None:
kwargs["processor"] = processor

return pipeline_class(model=model, framework=framework, task=task, **kwargs)
27 changes: 26 additions & 1 deletion src/transformers/pipelines/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
from ..image_processing_utils import BaseImageProcessor
from ..modelcard import ModelCard
from ..models.auto.configuration_auto import AutoConfig
from ..processing_utils import ProcessorMixin
from ..tokenization_utils import PreTrainedTokenizer
from ..utils import (
ModelOutput,
Expand Down Expand Up @@ -716,6 +717,7 @@ def build_pipeline_init_args(
has_tokenizer: bool = False,
has_feature_extractor: bool = False,
has_image_processor: bool = False,
has_processor: bool = False,
supports_binary_output: bool = True,
) -> str:
docstring = r"""
Expand All @@ -738,6 +740,11 @@ def build_pipeline_init_args(
image_processor ([`BaseImageProcessor`]):
The image processor that will be used by the pipeline to encode data for the model. This object inherits from
[`BaseImageProcessor`]."""
if has_processor:
docstring += r"""
processor ([`ProcessorMixin`]):
The processor that will be used by the pipeline to encode data for the model. This object inherits from
[`ProcessorMixin`]."""
docstring += r"""
modelcard (`str` or [`ModelCard`], *optional*):
Model card attributed to the model for this pipeline.
Expand Down Expand Up @@ -774,7 +781,11 @@ def build_pipeline_init_args(


PIPELINE_INIT_ARGS = build_pipeline_init_args(
has_tokenizer=True, has_feature_extractor=True, has_image_processor=True, supports_binary_output=True
has_tokenizer=True,
has_feature_extractor=True,
has_image_processor=True,
has_processor=True,
supports_binary_output=True,
)


Expand Down Expand Up @@ -805,6 +816,18 @@ class Pipeline(_ScikitCompat, PushToHubMixin):
constructor argument. If set to `True`, the output will be stored in the pickle format.
"""

# Previously, pipelines support only `tokenizer`, `feature_extractor`, and `image_processor`.
# As we start adding `processor`, we want to avoid loading processor for some pipelines, that don't required it,
qubvel marked this conversation as resolved.
Show resolved Hide resolved
# because, for example, use `image_processor` and `tokenizer` separately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get this comment 😁

Copy link
Member Author

@qubvel qubvel Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to make it clearer in 9793e01

# However, we want to enable it for new pipelines. Moreover, this allow us to granularly control loading components
# and avoid loading tokenizer/image_processor/feature_extractor twice: once as a separate object
# and once in the processor. The following flags a set this way for backward compatibility ans might be overridden
# in specific Pipeline class.
_load_processor = False
_load_image_processor = True
_load_feature_extractor = True
_load_tokenizer = True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

granular control for loading, see comment in the code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense and I appreciate us being explicit. This repeats what is said above, but this should be extremely clear in the pipelines documentation if possible


default_input_names = None

def __init__(
Expand All @@ -813,6 +836,7 @@ def __init__(
tokenizer: Optional[PreTrainedTokenizer] = None,
feature_extractor: Optional[PreTrainedFeatureExtractor] = None,
image_processor: Optional[BaseImageProcessor] = None,
processor: Optional[ProcessorMixin] = None,
modelcard: Optional[ModelCard] = None,
framework: Optional[str] = None,
task: str = "",
Expand All @@ -830,6 +854,7 @@ def __init__(
self.tokenizer = tokenizer
self.feature_extractor = feature_extractor
self.image_processor = image_processor
self.processor = processor
self.modelcard = modelcard
self.framework = framework

Expand Down
11 changes: 9 additions & 2 deletions tests/models/altclip/test_modeling_altclip.py
Original file line number Diff line number Diff line change
Expand Up @@ -436,9 +436,16 @@ class AltCLIPModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase)

# TODO: Fix the failed tests when this model gets more usage
def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
self,
pipeline_test_case_name,
config_class,
model_architecture,
tokenizer_name,
image_processor_name,
feature_extractor_name,
processor_name,
):
if pipeline_test_casse_name == "FeatureExtractionPipelineTests":
if pipeline_test_case_name == "FeatureExtractionPipelineTests":
return True

return False
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -167,9 +167,16 @@ class ASTModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):

# TODO: Fix the failed tests when this model gets more usage
def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
self,
pipeline_test_case_name,
config_class,
model_architecture,
tokenizer_name,
image_processor_name,
feature_extractor_name,
processor_name,
):
if pipeline_test_casse_name == "AudioClassificationPipelineTests":
if pipeline_test_case_name == "AudioClassificationPipelineTests":
return True

return False
Expand Down
11 changes: 9 additions & 2 deletions tests/models/bigbird_pegasus/test_modeling_bigbird_pegasus.py
Original file line number Diff line number Diff line change
Expand Up @@ -276,9 +276,16 @@ class BigBirdPegasusModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineT

# TODO: Fix the failed tests
def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
self,
pipeline_test_case_name,
config_class,
model_architecture,
tokenizer_name,
image_processor_name,
feature_extractor_name,
processor_name,
):
if pipeline_test_casse_name == "QAPipelineTests" and not tokenizer_name.endswith("Fast"):
if pipeline_test_case_name == "QAPipelineTests" and not tokenizer_name.endswith("Fast"):
return True

return False
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -236,9 +236,16 @@ class BlenderbotSmallModelTest(ModelTesterMixin, GenerationTesterMixin, Pipeline

# TODO: Fix the failed tests when this model gets more usage
def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
self,
pipeline_test_case_name,
config_class,
model_architecture,
tokenizer_name,
image_processor_name,
feature_extractor_name,
processor_name,
):
return pipeline_test_casse_name == "TextGenerationPipelineTests"
return pipeline_test_case_name == "TextGenerationPipelineTests"

def setUp(self):
self.model_tester = BlenderbotSmallModelTester(self)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -321,9 +321,16 @@ class FlaxBlenderbotSmallModelTest(FlaxModelTesterMixin, unittest.TestCase, Flax
all_generative_model_classes = (FlaxBlenderbotSmallForConditionalGeneration,) if is_flax_available() else ()

def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
self,
pipeline_test_case_name,
config_class,
model_architecture,
tokenizer_name,
image_processor_name,
feature_extractor_name,
processor_name,
):
return pipeline_test_casse_name == "TextGenerationPipelineTests"
return pipeline_test_case_name == "TextGenerationPipelineTests"

def setUp(self):
self.model_tester = FlaxBlenderbotSmallModelTester(self)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -198,9 +198,16 @@ class TFBlenderbotSmallModelTest(TFModelTesterMixin, PipelineTesterMixin, unitte
test_onnx = False

def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
self,
pipeline_test_case_name,
config_class,
model_architecture,
tokenizer_name,
image_processor_name,
feature_extractor_name,
processor_name,
):
return pipeline_test_casse_name == "TextGenerationPipelineTests"
return pipeline_test_case_name == "TextGenerationPipelineTests"

def setUp(self):
self.model_tester = TFBlenderbotSmallModelTester(self)
Expand Down
9 changes: 8 additions & 1 deletion tests/models/bros/test_modeling_bros.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,7 +295,14 @@ class BrosModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
# BROS requires `bbox` in the inputs which doesn't fit into the above 2 pipelines' input formats.
# see https://github.com/huggingface/transformers/pull/26294
def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
self,
pipeline_test_case_name,
config_class,
model_architecture,
tokenizer_name,
image_processor_name,
feature_extractor_name,
processor_name,
):
return True

Expand Down
9 changes: 8 additions & 1 deletion tests/models/cpm/test_tokenization_cpm.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,14 @@
class CpmTokenizationTest(unittest.TestCase):
# There is no `CpmModel`
def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
self,
pipeline_test_case_name,
config_class,
model_architecture,
tokenizer_name,
image_processor_name,
feature_extractor_name,
processor_name,
):
return True

Expand Down
11 changes: 9 additions & 2 deletions tests/models/ctrl/test_modeling_ctrl.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,9 +211,16 @@ class CTRLModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixin

# TODO: Fix the failed tests
def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
self,
pipeline_test_case_name,
config_class,
model_architecture,
tokenizer_name,
image_processor_name,
feature_extractor_name,
processor_name,
):
if pipeline_test_casse_name == "ZeroShotClassificationPipelineTests":
if pipeline_test_case_name == "ZeroShotClassificationPipelineTests":
# Get `tokenizer does not have a padding token` error for both fast/slow tokenizers.
# `CTRLConfig` was never used in pipeline tests, either because of a missing checkpoint or because a tiny
# config could not be created.
Expand Down
11 changes: 9 additions & 2 deletions tests/models/ctrl/test_modeling_tf_ctrl.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,9 +189,16 @@ class TFCTRLModelTest(TFModelTesterMixin, PipelineTesterMixin, unittest.TestCase

# TODO: Fix the failed tests
def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
self,
pipeline_test_case_name,
config_class,
model_architecture,
tokenizer_name,
image_processor_name,
feature_extractor_name,
processor_name,
):
if pipeline_test_casse_name == "ZeroShotClassificationPipelineTests":
if pipeline_test_case_name == "ZeroShotClassificationPipelineTests":
# Get `tokenizer does not have a padding token` error for both fast/slow tokenizers.
# `CTRLConfig` was never used in pipeline tests, either because of a missing checkpoint or because a tiny
# config could not be created.
Expand Down
9 changes: 8 additions & 1 deletion tests/models/falcon/test_modeling_falcon.py
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,14 @@ class FalconModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMix

# TODO (ydshieh): Check this. See https://app.circleci.com/pipelines/github/huggingface/transformers/79245/workflows/9490ef58-79c2-410d-8f51-e3495156cf9c/jobs/1012146
def is_pipeline_test_to_skip(
self, pipeline_test_casse_name, config_class, model_architecture, tokenizer_name, processor_name
self,
pipeline_test_case_name,
config_class,
model_architecture,
tokenizer_name,
image_processor_name,
feature_extractor_name,
processor_name,
):
return True

Expand Down
Loading
Loading