Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import structure & first three model refactors #31329

Merged
merged 10 commits into from
Sep 10, 2024

Conversation

LysandreJik
Copy link
Member

@LysandreJik LysandreJik commented Jun 8, 2024

Simplifying the import structure

This is the first PR that aims to facilitate adding new models by reducing the overhead caused by the __init__.py contributions.

It introduces three new concepts:

  • The export keyword, which accepts the backends tuple as input.
  • The define_import_structure, which accepts a filepath as input.
  • The __all__ keyword for easy exports

This approach adds a little overhead to what is currently done as it needs to open files and do some string checks across files. It is negligible compared to framework instantiations like torch but should still be highlighted; this approach is extremely specific to transformers and to the fact that we want to reduce the barrier of contributions to a minimum.

@export

The export keyword acts as an exporter. For now this is limited to the src/transformers/models path only, but if accepted, it can be propagated to the entire repository.

Using the @export decorator on a class or method exports it to be importable by third-parties. In doing so, it accepts a tuple of backends. In case a backend isn't installed, the object can still be imported; but using any method or attribute on this object will raise an error.

This method aims to deprecate and replace the use of dummy_xxx modules and objects.

define_import_structure

This method relies on all the objects in a module being correctly marked with @register and with the correct backends. It replaces the current complex dict structure dedicated to Lazy loading, and continues supporting lazy loading.

The __all__ keyword

This keyword is used here principally for static type checkers that cannot understand the @register keyword. Correctly exporting objects through this enables correct type hinting/type checking.

The define_import_structure keyword acts in conjunction with the __all__ method and the @export decorator to correctly identify which imports depend on what backends so as to reflect that in the lazy import scheme.

The rest of the model refactor is here: #31330

@LysandreJik
Copy link
Member Author

I'd be interested in having your review when you find the time @amyeroberts @ArthurZucker @Wauplin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly cleanup removing items that should not have been here in the first place

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nice that you added some tests as well!
🔥 mostly let's try to remove the all by using the _TFBertLayer notation instead, and having one dirty file with legacy imports

src/transformers/models/albert/__init__.py Show resolved Hide resolved
src/transformers/models/albert/modeling_albert.py Outdated Show resolved Hide resolved
src/transformers/utils/import_utils.py Outdated Show resolved Hide resolved
# "sentencepiece", "tf"
# )
# )
elif "backends" in lines[previous_index + 1]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this specifc case, parsing the super small code would be simpler with AST, WDYT?
not parsing the entire module, but just a few lines might be fast enough!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or regex might be a bit simpler ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done this yet. If we see this doesn't affect performance I'm happy to go with that in a future PR.

src/transformers/utils/import_utils.py Show resolved Hide resolved
src/transformers/utils/import_utils.py Show resolved Hide resolved
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 🔥 🔥 🔥 🔥 Can't wait to see so much duplicated code cut!

Main comment is that the @register pattern is quite repetitive, especially as we know e.g. all the pytorch layers with require torch.

Some of the logic I think could be simplified with some regex ✨magic✨

src/transformers/models/albert/__init__.py Show resolved Hide resolved
src/transformers/models/albert/modeling_albert.py Outdated Show resolved Hide resolved
tests/utils/test_import_structure.py Show resolved Hide resolved
Returns the content of the __all__ variable in the file content.
Returns None if not defined, otherwise returns a list of strings.
"""
lines = file_content.split("\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something we could extract directly with a regex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! I haven't done that yet but agree it could be done if it doesn't affect performance. Probably in a future PR.

Comment on lines 1708 to 1719
# This allows registering items with other decorators. We'll take a look
# at the line that follows at the same indentation level.
if line.startswith((" ", "\t", "@", ")")) and not line.startswith("@register"):
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to use a regex to capture just the registered backends which would mean we don't need to try to handle other decorators

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

src/transformers/utils/import_utils.py Show resolved Hide resolved
src/transformers/utils/import_utils.py Outdated Show resolved Hide resolved
# Backends are defined on the same line as register
if "backends" in previous_line:
backends_string = previous_line.split("backends=")[1].split("(")[1].split(")")[0]
backends = tuple(sorted([b.strip("'\",") for b in backends_string.split(", ")]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for sorting them here? Especially if they're just going to be passed to the frozenset on L1762

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @LysandreJik, thanks for the ping. I review it to the extent of my small knowledge of transformers but for some parts feel free to ignore. I started my review before Amy and ArthurZucker reviews so some comments might be outdated. And great job overall! Let's make sure to optimize the import logic to have as little overhead as possible :)

src/transformers/models/align/__init__.py Show resolved Hide resolved
src/transformers/models/albert/configuration_albert.py Outdated Show resolved Hide resolved
src/transformers/models/albert/__init__.py Show resolved Hide resolved
src/transformers/utils/import_utils.py Outdated Show resolved Hide resolved
src/transformers/utils/import_utils.py Outdated Show resolved Hide resolved
src/transformers/utils/import_utils.py Outdated Show resolved Hide resolved
Comment on lines +1604 to +1717
class Placeholder(metaclass=DummyObject):
_backends = missing_backends

def __init__(self, *args, **kwargs):
requires_backends(self, missing_backends)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the logic a bit complex here:

  • we have a method requires_backends that takes an input an obj and a list of backends
    • the obj is forwarded only to retrieve the name of the class/of the object (L1473)
  • we have a DummyObject metaclass to define __getattributes__ => fails when accessing any public property/method
  • we have a Placeholder object using the metaclass + defining a dummy __init__ method that also requires backends.
    • Placeholder.__name__ and Placeholder.__module__ are then set afterward.

I feel that this is a bit clunky. The ImportError logic is set twice (once in metaclass, one in class). Some class attribute are set in class definition (_backends) while others are set afterwards (__name__/__module__). _backends is set only so that the metaclass sees it, etc.

Something you can do to dynamically generate classes is to still use a metaclass but use is as a callable. It's quite low-level Python but what you are doing here is low-level anyway. Something like this should work:

class PlaceholderFactory(type):

    # 1. check that `_backends` is provided
    def __new__(cls, name: str, bases: Any, namespace: Any) -> Any:
        if "_backends" not in namespace:
            raise RuntimeError("`_backends` must be provided when generating a class with `PlaceholderFactory`.")
        return super().__new__(cls, name, bases, namespace)

    # 2. forbid to call `__init__`
    def __call__(cls, *args, **kwargs) -> Any:
        cls._requires_backends()

    # 3. forbid any class attributes
    def __getattribute__(cls, name: str) -> Any:
        if "_" in name:
            return super().__getattribute__(name)
        cls._requires_backends()

    # 4.  alias for `requires_backends` but all the logic can also be moved inside it
    def _requires_backends(cls):
        requires_backends(cls.__name__, cls._backends)

Once you have this, you can dynamically create the placeholder class like this:

value = PlaceholderFactory(name, (), {"__module__": self.__spec__, "_backends": missing_backends})

IMO it makes it clearer which part is handling what.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a very valuable comment with which I entirely agree. I would be happy for us to get to that in a follow-up PR if that works with you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course!

src/transformers/utils/import_utils.py Outdated Show resolved Hide resolved
def spread_import_structure(nested_import_structure):
def propagate_tuple(unordered_import_structure):
tuple_first_import_structure = {}
for _key, _value in unordered_import_structure.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Instead of _key and _value I'd rather write more explicit names like module_name / module (?). The future ourselves will thanks us in 2 years :D

@LysandreJik LysandreJik force-pushed the simplify-contributions-model_import_structure branch from d805428 to f8527ee Compare July 5, 2024 09:56
@LysandreJik LysandreJik force-pushed the simplify-contributions-model_import_structure branch from f8527ee to 908dceb Compare July 25, 2024 12:10
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ArthurZucker
Copy link
Collaborator

Not stale

@LysandreJik LysandreJik force-pushed the simplify-contributions-model_import_structure branch from 5d76ab7 to 9b05311 Compare September 6, 2024 08:02
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@LysandreJik LysandreJik force-pushed the simplify-contributions-model_import_structure branch from 3c57001 to 3967eaa Compare September 6, 2024 12:53
@LysandreJik
Copy link
Member Author

Thanks all for your reviews! This should now be ready to merge!

Since the last review cycle, here are the changes that have been done:

__all__ as the default

The items written down in __all__ now do NOT need to be exported with export (previously register); they are naturally seen as exported.

According to the file in which they are defined, specific backends are given to them.

Namely these are the assumptions:

BASE_FILE_REQUIREMENTS = {
    lambda e: 'modeling_tf_' in e: ('tf',),
    lambda e: 'modeling_flax_' in e: ('flax',),
    lambda e: 'modeling_' in e: ('torch',),
    lambda e: e.startswith('tokenization_') and e.endswith('_fast'): ('tokenizers',),
}

Therefore:

  • A file containing modeling_tf_ will add a tf requirement
  • A file containing modeling_flax_ will add a flax requirement
  • A file containing modeling_ that did not trigger one of the two above will add a torch requirement
  • A file starting with tokenization_ and ending with _fast (the .py is removed) will add a tokenizers requirement

Register -> Export

register has been renamed to export. Given the above changes to __all__, the export keyword is now solely needed for exported objects that need additional or different backends than the ones given by the filename in which they reside.

An example in this PR is the AlbertTokenizer, which has a sentencepiece requirement but lives in tokenization_albert.py.

We cannot make assumptions according to the filename, so we manually export it with the right requirements:

@export(backends=("sentencepiece",))
class AlbertTokenizer(PreTrainedTokenizer):
    """

Apart from that, the code was cleaned up, optimized, and benchmarked. On the latest commits, the difference is not visible when importing from transformers: each __init__ file will add an overhead of ~.67ms on my machine up to 1.5ms for the very first instantiation.

On #31330, the total amount of time added for all modeling files stands between 150ms and 200ms.

The one comment I have left for later is yours @Wauplin with which I agree, but I don't have the bandwidth to get to right now: #31329 (comment)

@LysandreJik
Copy link
Member Author

There are some nice next steps following this:

  • This PR takes care of the rest of the models: Rest of model init refactors #31330
  • Once all models are handled, the export and __all__ keywords can be propagated through the root of the repo
  • Finally, the dummies can be removed and the main __init__ can be cleaned accordingly

@LysandreJik
Copy link
Member Author

As seen with @ArthurZucker and @amyeroberts, merging with the first three models.

@LysandreJik LysandreJik merged commit f24f084 into main Sep 10, 2024
24 checks passed
@LysandreJik LysandreJik deleted the simplify-contributions-model_import_structure branch September 10, 2024 09:10
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
* Import structure & first three model refactors

* Register -> Export. Export all in __all__. Sensible defaults according to filename.

* Apply most comments from Amy and some comments from Lucain

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Lucain Pouget <lucainp@gmail.com>

* Style

* Add comment

* Clearer .py management

* Raise if not in backend mapping

* More specific type

* More efficient listdir

* Misc fixes

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Lucain Pouget <lucainp@gmail.com>
amyeroberts added a commit to amyeroberts/transformers that referenced this pull request Oct 2, 2024
* Import structure & first three model refactors

* Register -> Export. Export all in __all__. Sensible defaults according to filename.

* Apply most comments from Amy and some comments from Lucain

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Lucain Pouget <lucainp@gmail.com>

* Style

* Add comment

* Clearer .py management

* Raise if not in backend mapping

* More specific type

* More efficient listdir

* Misc fixes

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Lucain Pouget <lucainp@gmail.com>
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* Import structure & first three model refactors

* Register -> Export. Export all in __all__. Sensible defaults according to filename.

* Apply most comments from Amy and some comments from Lucain

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Lucain Pouget <lucainp@gmail.com>

* Style

* Add comment

* Clearer .py management

* Raise if not in backend mapping

* More specific type

* More efficient listdir

* Misc fixes

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Lucain Pouget <lucainp@gmail.com>
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
* Import structure & first three model refactors

* Register -> Export. Export all in __all__. Sensible defaults according to filename.

* Apply most comments from Amy and some comments from Lucain

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Lucain Pouget <lucainp@gmail.com>

* Style

* Add comment

* Clearer .py management

* Raise if not in backend mapping

* More specific type

* More efficient listdir

* Misc fixes

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Lucain Pouget <lucainp@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants