-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dynamic module import error #21646
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Run the following commpand python run_debug.py with the 2 files run_debug.pyimport os
for i in range(300):
print(i)
with open("output.txt", "a+") as fp:
fp.write(str(i) + "\n")
os.system("python3 debug.py") (we need to run the debugging code debug.pyimport time, traceback, tempfile, os
from transformers.utils import HF_MODULES_CACHE
def foo():
from transformers import AutoModel
model = AutoModel.from_pretrained("hf-internal-testing/test_dynamic_model", trust_remote_code=True)
# Test model can be reloaded.
with tempfile.TemporaryDirectory() as tmp_dir:
model.save_pretrained(tmp_dir)
try:
reloaded_model = AutoModel.from_pretrained(tmp_dir, trust_remote_code=True)
except Exception as e:
print(e)
with open("output.txt", "a+") as fp:
fp.write(f"{traceback.format_exc()}" + "\n")
if __name__ == "__main__":
timeout = os.environ.get("PYTEST_TIMEOUT", 10)
timeout = int(timeout)
for i in range(1):
time.sleep(1)
print(i)
with open("output.txt", "a+") as fp:
fp.write(str(i) + "\n")
try:
os.system(f'rm -rf "{HF_MODULES_CACHE}"')
except:
pass
foo()
print("=" * 80)
with open("output.txt", "a+") as fp:
fp.write("=" * 80 + "\n") |
Thanks for working on this! I was going to have a look at it when back from vacation but if you beat me to it ;-) My solution would have been to change the way the local module works: for now I dumb every file there without structure, I wanted to add a folder per model (so given by |
@sgugger I am open to explore further, but I have a bit doubt regarding
While I am debugging (this single test), the only model appears
so I don't see multiple models sharing the same folder, but the issue still occurs. So, I am not sure how to proceed with the solution you mentioned above. |
Hmm, there seems to affect other related tests. I will have to take a look 😭 |
I believe the conflict is between two files in local being written/deleted concurrently (but I might be wrong) hence making sure we things like
might fix the issue. |
On (circleci) CI, we have def foo():
from transformers import AutoModel
model = AutoModel.from_pretrained("hf-internal-testing/test_dynamic_model", trust_remote_code=True)
# Test model can be reloaded.
with tempfile.TemporaryDirectory() as tmp_dir:
model.save_pretrained(tmp_dir)
reloaded_model = AutoModel.from_pretrained(tmp_dir, trust_remote_code=True) I could explore anyway - but maybe let me finalize the current PR (make CI green) first |
b678fba
to
e347d17
Compare
@@ -212,7 +244,7 @@ def get_cached_module_file( | |||
# Download and cache module_file from the repo `pretrained_model_name_or_path` of grab it if it's a local file. | |||
pretrained_model_name_or_path = str(pretrained_model_name_or_path) | |||
if os.path.isdir(pretrained_model_name_or_path): | |||
submodule = "local" | |||
submodule = f"local_{pretrained_model_name_or_path.replace(os.path.sep, '_')}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgugger You already mentioned this in your comment. As I said, the issue doesn't seem come from the concurrent file operations. However, the fix I implemented in this PR add more operations to the module directory, and at some point it looks getting some race condition (not 100% confident).
Therefore, I move forward to make the module directory depending on pretrained_model_name_or_path
, but I need to add replace(os.path.sep, '_')
to avoid the case where pretrained_model_name_or_path
being like /tmp/xxxyyy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just taje the xxxyyy which should solve the issue for the tests (since they are all in tmp dirs that have unique names).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgugger Sorry, but what is taje the xxxyyy
?
Regarding they are all in tmp dirs that have unique names -> should solve the issue for the tests
:
I guess what I did here also gives the unique names (during testing), but without the (latest) changes in get_class_in_module
, we still get the same issue, as I already run it several times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you ever want to double check: run this code snippet
This test issue is really tricky to reproduce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're the one who called your folder /tmp/xxxyyy
in your first comment. I'm just saying you should take the last part, so pretrained_model_name_or_path.split(os.path.sep)[-1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
(My brain also has tmp memory regarding xxxyyy)
# remove `configuration.py`: this is necessary when we try to import modeling module, or other tokenizer/processor | ||
# modules, while configuration module has been imported previously. | ||
# TODO: This is only a simple heuristic. In general, we might need to consider any dynamic module that has been | ||
# imported. However, we don't have this information so far. | ||
if os.path.isfile(f"{module_dir}/configuration.py"): | ||
os.remove(f"{module_dir}/configuration.py") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very weird and way to specific. Just because the tests call the file configuration doesn't mean it will always be called this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer need to deal with this specific file, but the same trick is required for the module file (that we want to import)
@@ -212,7 +244,7 @@ def get_cached_module_file( | |||
# Download and cache module_file from the repo `pretrained_model_name_or_path` of grab it if it's a local file. | |||
pretrained_model_name_or_path = str(pretrained_model_name_or_path) | |||
if os.path.isdir(pretrained_model_name_or_path): | |||
submodule = "local" | |||
submodule = f"local_{pretrained_model_name_or_path.replace(os.path.sep, '_')}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just taje the xxxyyy which should solve the issue for the tests (since they are all in tmp dirs that have unique names).
# copy to a temporary directory | ||
shutil.copy(f"{module_dir}/{module_file_name}", tmp_dir) | ||
cmd = f'import os; os.remove("{module_dir}/{module_file_name}")' | ||
os.system(f"python3 -c '{cmd}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no more test error is we remove the file in a subprocess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bit crazy! Can you use the subprocess command instead of os.system
? Not sure if this is going to fly well on Windows for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to subprocess
. Tested on my Windows env. and it works.
shutil.copy(f"{module_dir}/{module_file_name}", tmp_dir) | ||
cmd = f'import os; os.remove("{module_dir}/{module_file_name}")' | ||
os.system(f"python3 -c '{cmd}'") | ||
# os.remove(f"{module_dir}/{module_file_name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.remove(f"{module_dir}/{module_file_name}")
is not working!!!!!!!
module_path = module_path.replace(os.path.sep, ".") | ||
module = importlib.import_module(module_path) | ||
return getattr(module, class_name) | ||
with tempfile.TemporaryDirectory() as tmp_dir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not the location to load the module. It's just to hold the file temporarily , and it will be copied back to the original place.
Finally get it:
|
Don't know why we get an error where a module is not a python file, but a package. See below. FAILED tests/models/auto/test_image_processing_auto.py::AutoImageProcessorTest::test_from_pretrained_dynamic_image_processor
- ModuleNotFoundError: No module named 'transformers_modules.local__tmp_tmpkcj_lb5j' |
This PR is ready for review. There is one failure thtat I can't reproduce with the same code snippet. See this comment. It seems this happens much rarely. And probably we can investigate it if it happens again. |
shutil.copy(f"{module_dir}/{module_file_name}", tmp_dir) | ||
# On Windows, we need this character `r` before the path argument of `os.remove` | ||
cmd = f'import os; os.remove(r"{module_dir}{os.path.sep}{module_file_name}")' | ||
subprocess.run(["python", "-c", cmd]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something goes wrong in the subprocess.run
, no error will be thrown (in the process that calls this method).
I think we should capture/check the output of subprocess.run
, and do something:
- either: not to call
shutil.copyfile
below (although this makes the test flaky in this logic branch) - or: throw an error manually with some information
Let me know if you have any suggestion :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to do something? If there is a problem deleting the file (which we copy just after), at worst we get the flaky failure again (though it should be extremely rare at this stage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, right!
Thanks for investigating so deeply this issue! |
* fix dynamic module import error --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
What does this PR do?
Issue
We have failing test
FAILED tests/models/auto/test_modeling_auto.py::AutoModelTest::test_from_pretrained_dynamic_model_distant ModuleNotFoundError: No module named 'transformers_modules.local.modeling'
The full trace is given at the end.
After a long debug process, it turns out that, when reloading from the saved model
if
configuration.py
appears in the dynamic module directory (heretransformers_modules/local
), sometimes it interferes the import oftransformers_modules.local.modeling
. I have no clear reason for this situation however.What this PR fixes
This PR therefore tries to avoid the appearance of other module files while the code imports a specific module file, around this line
Result
Running the reproduce code snippet (provided in the comment below) in a loop of 300 times:
with this PR: this issue doesn't appear, job run
without the fix: this issue appears with 50% probability job run
Full traceback