Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dynamic module import error #21646

Merged
merged 14 commits into from
Feb 17, 2023
Merged

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Feb 15, 2023

What does this PR do?

Issue

We have failing test

FAILED tests/models/auto/test_modeling_auto.py::AutoModelTest::test_from_pretrained_dynamic_model_distant

ModuleNotFoundError: No module named 'transformers_modules.local.modeling'

The full trace is given at the end.

After a long debug process, it turns out that, when reloading from the saved model

model = AutoModel.from_pretrained("hf-internal-testing/test_dynamic_model", trust_remote_code=True)
with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir)
    reloaded_model = AutoModel.from_pretrained(tmp_dir, trust_remote_code=True)

if configuration.py appears in the dynamic module directory (here transformers_modules/local), sometimes it interferes the import of transformers_modules.local.modeling. I have no clear reason for this situation however.

What this PR fixes

This PR therefore tries to avoid the appearance of other module files while the code imports a specific module file, around this line

def get_class_in_module():
    ...
    module = importlib.import_module(module_path)
    ...

Result

Running the reproduce code snippet (provided in the comment below) in a loop of 300 times:

  • with this PR: this issue doesn't appear, job run

  • without the fix: this issue appears with 50% probability job run

Full traceback

Traceback (most recent call last):
    ...
    reloaded_model = AutoModel.from_pretrained(tmp_dir, trust_remote_code=True)
  File "/home/circleci/.pyenv/versions/3.7.12/lib/python3.7/site-packages/transformers/models/auto/auto_factory.py", line 463, in from_pretrained
    pretrained_model_name_or_path, module_file + ".py", class_name, **hub_kwargs, **kwargs
  File "/home/circleci/.pyenv/versions/3.7.12/lib/python3.7/site-packages/transformers/dynamic_module_utils.py", line 367, in get_class_from_dynamic_module
    return get_class_in_module(class_name, final_module.replace(".py", ""))
  File "/home/circleci/.pyenv/versions/3.7.12/lib/python3.7/site-packages/transformers/dynamic_module_utils.py", line 147, in get_class_in_module
    module = importlib.import_module(module_path)
  File "/home/circleci/.pyenv/versions/3.7.12/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'transformers_modules.local.modeling'

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Feb 15, 2023

The documentation is not available anymore as the PR was closed or merged.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 15, 2023

Run the following commpand

python run_debug.py

with the 2 files

run_debug.py

import os

for i in range(300):
    print(i)
    with open("output.txt", "a+") as fp:
        fp.write(str(i) + "\n")
    os.system("python3 debug.py")

(we need to run the debugging code foo (contained in file debug.py) in difference processes each time, instead of running the script debug.py with a for loop defined inside it - as this will be always in the same process)

debug.py

import time, traceback, tempfile, os
from transformers.utils import HF_MODULES_CACHE


def foo():
    from transformers import AutoModel

    model = AutoModel.from_pretrained("hf-internal-testing/test_dynamic_model", trust_remote_code=True)
    # Test model can be reloaded.
    with tempfile.TemporaryDirectory() as tmp_dir:
        model.save_pretrained(tmp_dir)
        try:
            reloaded_model = AutoModel.from_pretrained(tmp_dir, trust_remote_code=True)
        except Exception as e:
            print(e)
            with open("output.txt", "a+") as fp:
                fp.write(f"{traceback.format_exc()}" + "\n")


if __name__ == "__main__":
    timeout = os.environ.get("PYTEST_TIMEOUT", 10)
    timeout = int(timeout)
    for i in range(1):
        time.sleep(1)
        print(i)
        with open("output.txt", "a+") as fp:
            fp.write(str(i) + "\n")
        try:
            os.system(f'rm -rf "{HF_MODULES_CACHE}"')
        except:
            pass
        foo()
        print("=" * 80)
        with open("output.txt", "a+") as fp:
            fp.write("=" * 80 + "\n")

@sgugger
Copy link
Collaborator

sgugger commented Feb 15, 2023

Thanks for working on this! I was going to have a look at it when back from vacation but if you beat me to it ;-)

My solution would have been to change the way the local module works: for now I dumb every file there without structure, I wanted to add a folder per model (so given by pretrained_model_name_or_path) which would also fix this issue I believe.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 15, 2023

@sgugger I am open to explore further, but I have a bit doubt regarding

I wanted to add a folder per model (so given by pretrained_model_name_or_path) which would also fix this issue I believe.

While I am debugging (this single test), the only model appears

transformers_modules/hf-internal-testing/test_dynamic_model/12345678901234567890.../
transformers_modules/local/

so I don't see multiple models sharing the same folder, but the issue still occurs. So, I am not sure how to proceed with the solution you mentioned above.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 15, 2023

Hmm, there seems to affect other related tests. I will have to take a look 😭

@sgugger
Copy link
Collaborator

sgugger commented Feb 15, 2023

I believe the conflict is between two files in local being written/deleted concurrently (but I might be wrong) hence making sure we things like

transformers_modules/local/123456...
transformers_modules/local/777888...

might fix the issue.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 15, 2023

I believe the conflict is between two files in local being written/deleted concurrently

On (circleci) CI, we have pytest -n 8, which might cause the situation you mentioned. But I am debugging by running the following function in a loop (and the issue still appears), so I kinda feel the issue is not from the concurrently read/write/delete operations

def foo():
    from transformers import AutoModel

    model = AutoModel.from_pretrained("hf-internal-testing/test_dynamic_model", trust_remote_code=True)
    # Test model can be reloaded.
    with tempfile.TemporaryDirectory() as tmp_dir:
        model.save_pretrained(tmp_dir)
        reloaded_model = AutoModel.from_pretrained(tmp_dir, trust_remote_code=True)

I could explore anyway - but maybe let me finalize the current PR (make CI green) first

@ydshieh ydshieh force-pushed the fix_dynamic_module_import_flaky_error branch from b678fba to e347d17 Compare February 15, 2023 20:50
@@ -212,7 +244,7 @@ def get_cached_module_file(
# Download and cache module_file from the repo `pretrained_model_name_or_path` of grab it if it's a local file.
pretrained_model_name_or_path = str(pretrained_model_name_or_path)
if os.path.isdir(pretrained_model_name_or_path):
submodule = "local"
submodule = f"local_{pretrained_model_name_or_path.replace(os.path.sep, '_')}"
Copy link
Collaborator Author

@ydshieh ydshieh Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgugger You already mentioned this in your comment. As I said, the issue doesn't seem come from the concurrent file operations. However, the fix I implemented in this PR add more operations to the module directory, and at some point it looks getting some race condition (not 100% confident).

Therefore, I move forward to make the module directory depending on pretrained_model_name_or_path, but I need to add replace(os.path.sep, '_') to avoid the case where pretrained_model_name_or_path being like /tmp/xxxyyy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just taje the xxxyyy which should solve the issue for the tests (since they are all in tmp dirs that have unique names).

Copy link
Collaborator Author

@ydshieh ydshieh Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgugger Sorry, but what is taje the xxxyyy?

Regarding they are all in tmp dirs that have unique names -> should solve the issue for the tests:
I guess what I did here also gives the unique names (during testing), but without the (latest) changes in get_class_in_module, we still get the same issue, as I already run it several times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you ever want to double check: run this code snippet

This test issue is really tricky to reproduce

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're the one who called your folder /tmp/xxxyyy in your first comment. I'm just saying you should take the last part, so pretrained_model_name_or_path.split(os.path.sep)[-1]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

(My brain also has tmp memory regarding xxxyyy)

Comment on lines 154 to 159
# remove `configuration.py`: this is necessary when we try to import modeling module, or other tokenizer/processor
# modules, while configuration module has been imported previously.
# TODO: This is only a simple heuristic. In general, we might need to consider any dynamic module that has been
# imported. However, we don't have this information so far.
if os.path.isfile(f"{module_dir}/configuration.py"):
os.remove(f"{module_dir}/configuration.py")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very weird and way to specific. Just because the tests call the file configuration doesn't mean it will always be called this way.

Copy link
Collaborator Author

@ydshieh ydshieh Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer need to deal with this specific file, but the same trick is required for the module file (that we want to import)

@@ -212,7 +244,7 @@ def get_cached_module_file(
# Download and cache module_file from the repo `pretrained_model_name_or_path` of grab it if it's a local file.
pretrained_model_name_or_path = str(pretrained_model_name_or_path)
if os.path.isdir(pretrained_model_name_or_path):
submodule = "local"
submodule = f"local_{pretrained_model_name_or_path.replace(os.path.sep, '_')}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just taje the xxxyyy which should solve the issue for the tests (since they are all in tmp dirs that have unique names).

# copy to a temporary directory
shutil.copy(f"{module_dir}/{module_file_name}", tmp_dir)
cmd = f'import os; os.remove("{module_dir}/{module_file_name}")'
os.system(f"python3 -c '{cmd}'")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no more test error is we remove the file in a subprocess.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bit crazy! Can you use the subprocess command instead of os.system? Not sure if this is going to fly well on Windows for instance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to subprocess. Tested on my Windows env. and it works.

shutil.copy(f"{module_dir}/{module_file_name}", tmp_dir)
cmd = f'import os; os.remove("{module_dir}/{module_file_name}")'
os.system(f"python3 -c '{cmd}'")
# os.remove(f"{module_dir}/{module_file_name}")
Copy link
Collaborator Author

@ydshieh ydshieh Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.remove(f"{module_dir}/{module_file_name}") is not working!!!!!!!

module_path = module_path.replace(os.path.sep, ".")
module = importlib.import_module(module_path)
return getattr(module, class_name)
with tempfile.TemporaryDirectory() as tmp_dir:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the location to load the module. It's just to hold the file temporarily , and it will be copied back to the original place.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 16, 2023

Finally get it:

  • we don't need to remove other files (config, __init__.py) or __pycache__ folder
  • the point is: we need to remove the module_file_name in a subprocess, then copy it back
    • os.system("rm -rf ...") works: as it is in another process
    • os.system(f"python3 -c '{cmd}'"): same, but we don't use Linux specific command --> way to go
    • os.remove(...): not working! I could not explain (as I don't know the reason behind) 😢

@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 16, 2023

Don't know why we get an error where a module is not a python file, but a package. See below.
Can't reproduce so far, but the fix works for the auto model dynamic loading test.

FAILED tests/models/auto/test_image_processing_auto.py::AutoImageProcessorTest::test_from_pretrained_dynamic_image_processor

 - ModuleNotFoundError: No module named 'transformers_modules.local__tmp_tmpkcj_lb5j'

@ydshieh ydshieh changed the title [WIP] Fix dynamic module import error Fix dynamic module import error Feb 16, 2023
@ydshieh ydshieh marked this pull request as ready for review February 16, 2023 16:49
@ydshieh
Copy link
Collaborator Author

ydshieh commented Feb 16, 2023

This PR is ready for review.

There is one failure thtat I can't reproduce with the same code snippet. See this comment. It seems this happens much rarely. And probably we can investigate it if it happens again.

@ydshieh ydshieh requested a review from sgugger February 16, 2023 19:46
shutil.copy(f"{module_dir}/{module_file_name}", tmp_dir)
# On Windows, we need this character `r` before the path argument of `os.remove`
cmd = f'import os; os.remove(r"{module_dir}{os.path.sep}{module_file_name}")'
subprocess.run(["python", "-c", cmd])
Copy link
Collaborator Author

@ydshieh ydshieh Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something goes wrong in the subprocess.run, no error will be thrown (in the process that calls this method).
I think we should capture/check the output of subprocess.run, and do something:

  • either: not to call shutil.copyfile below (although this makes the test flaky in this logic branch)
  • or: throw an error manually with some information

Let me know if you have any suggestion :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do something? If there is a problem deleting the file (which we copy just after), at worst we get the flaky failure again (though it should be extremely rare at this stage).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, right!

@sgugger
Copy link
Collaborator

sgugger commented Feb 17, 2023

Thanks for investigating so deeply this issue!

@ydshieh ydshieh merged commit 7f1cdf1 into main Feb 17, 2023
@ydshieh ydshieh deleted the fix_dynamic_module_import_flaky_error branch February 17, 2023 20:22
ArthurZucker pushed a commit to ArthurZucker/transformers that referenced this pull request Mar 2, 2023
* fix dynamic module import error

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants