-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU training does not move on this interface #36
Comments
The program will freeze here for a long time |
What about using gloo as backend? It may be due to a NCCL error. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
root@18dc3f8e2e1d:/workspace/wangs/DenseTNT# python src/run.py --argoverse --future_frame_num 30 --do_train --data_dir /workspace/datasets/Argoverse/train/data/ --output_dir models.densetnt.1 --hidden_size 128 --train_batch_size 64 --use_map --core_num 16 --use_centerline --distributed_training 8 --other_params semantic_lane direction l1_loss goals_2D enhance_global_graph subdivide goal_scoring laneGCN point_sub_graph lane_scoring complete_traj complete_traj-3
{'add_prefix': None, 'agent_type': None, 'argoverse': True, 'attention_decay': False, 'autoregression': None, 'core_num': 16, 'cuda_visible_device_num': None, 'data_dir': '/workspace/datasets/Argoverse/train/data/', 'data_dir_for_val': 'val/data/', 'debug': False, 'distributed_training': 8, 'do_eval': False, 'do_test': False, 'do_train': True, 'eval_batch_size': 64, 'eval_params': [], 'future_frame_num': 30, 'future_test_frame_num': 16, 'global_graph_depth': 1, 'gpu_split': 0, 'hidden_dropout_prob': 0.1, 'hidden_size': 128, 'initializer_range': 0.02, 'inter_agent_types': None, 'learning_rate': 0.001, 'log_dir': 'models.densetnt.1', 'lstm': False, 'master_port': '12355', 'max_distance': 50.0, 'method_span': [0, 1], 'mode_num': 6, 'model_recover_path': None, 'model_save_dir': 'models.densetnt.1/model_save', 'multi': None, 'nms_threshold': None, 'no_agents': False, 'no_cuda': False, 'no_sub_graph': False, 'not_use_api': False, 'num_train_epochs': 16.0, 'nuscenes': False, 'old_version': False, 'other_params': {'semantic_lane': True, 'direction': True, 'l1_loss': True, 'goals_2D': True, 'enhance_global_graph': True, 'subdivide': True, 'goal_scoring': True, 'laneGCN': True, 'point_sub_graph': True, 'lane_scoring': True, 'complete_traj': True, 'complete_traj-3': True}, 'output_dir': 'models.densetnt.1', 'placeholder': 0.0, 'reuse_temp_file': False, 'seed': 42, 'single_agent': True, 'stage_one_K': None, 'sub_graph_batch_size': 8000, 'sub_graph_depth': 3, 'temp_file_dir': 'models.densetnt.1/temp_file', 'train_batch_size': 64, 'train_extra': False, 'train_params': [], 'use_centerline': True, 'use_map': True, 'visualize': False, 'waymo': False, 'weight_decay': 0.01}
10/21/2022 01:57:04 - INFO - main - ***** args *****
output_dir models.densetnt.1
other_params ['semantic_lane', 'direction', 'l1_loss', 'goals_2D', 'enhance_global_graph', 'subdivide', 'goal_scoring', 'laneGCN', 'point_sub_graph', 'lane_scoring', 'complete_traj', 'complete_traj-3']
10/21/2022 01:57:11 - INFO - main - device: cuda
Loading dataset ['/workspace/datasets/Argoverse/train/data/']
/opt/conda/lib/python3.8/site-packages/scipy/init.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.4)
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
10/21/2022 01:57:12 - INFO - argoverse.data_loading.vector_map_loader - Loaded root: ArgoverseVectorMap
Running DDP on rank 3.
Running DDP on rank 5.
Running DDP on rank 1.
Running DDP on rank 7.
Running DDP on rank 0.
Running DDP on rank 6.
Running DDP on rank 4.
Running DDP on rank 2.
10/21/2022 01:57:13 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 6
10/21/2022 01:57:13 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 4
10/21/2022 01:57:13 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 2
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 3
10/21/2022 01:57:14 - INFO - argoverse.data_loading.vector_map_loader - Loaded root: ArgoverseVectorMap
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 5
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 1
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 0
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 7
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
10/21/2022 01:57:14 - INFO - torch.distributed.distributed_c10d - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
['/workspace/datasets/Argoverse/train/data/129892.csv', '/workspace/datasets/Argoverse/train/data/179439.csv', '/workspace/datasets/Argoverse/train/data/153379.csv', '/workspace/datasets/Argoverse/train/data/11971.csv', '/workspace/datasets/Argoverse/train/data/181683.csv'] ['/workspace/datasets/Argoverse/train/data/209097.csv', '/workspace/datasets/Argoverse/train/data/102649.csv', '/workspace/datasets/Argoverse/train/data/186077.csv', '/workspace/datasets/Argoverse/train/data/74459.csv', '/workspace/datasets/Argoverse/train/data/89887.csv']
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 205942/205942 [06:12<00:00, 552.14it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 205942/205942 [00:07<00:00, 27049.96it/s]
valid data size is 205942
The text was updated successfully, but these errors were encountered: