-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training executes first Epoch but then stop itself, how come? #13
Comments
Can you show more details (printed logs) ? |
That's output logs:
Those are last lines of error log: INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group Code doesn't crash, it's just waiting for something I can't figure out. |
The logs already reported that "FileNotFoundError: [Errno 2] No such file or directory: 'java'". The problem is that you don't install java. |
The commands are that "$ sudo apt-get update" and "$ sudo apt-get install openjdk-8-jdk". And then you can run "$ java -version" to verify it. |
I already have java: but it keep saying FileNotFoundError: [Errno 2] No such file or directory: 'java': 'java' |
I suppose that there are some issues in the setting of your environment variables. Maybe you should check your environment variables or add the path of "java" into the train.py. |
Just like this |
I've this file in jvm directory: java java-openjdk jre-1.8.0 Should I set 'java-1.8.0-openjdk-1.8.0.322.b06-1.el7_9.x86_64' as JAVA HOME? |
I don't know why you don't have the directory "java-8-openjdk-amd64" if you install java successfully. You could try it first.
|
No description provided.
The text was updated successfully, but these errors were encountered: