You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to run this model to evaluate dumy folders at the same time with one GPU (A100) which has 80G.
When I tried to evaluate one folder, it works well. However, if I try to run the other one additionally, an error appears.
It seems that there is a problem with the pytorch.distributed package.
When I googled it, people said if I change the port number, this problem will be solved.
Do you �know how to change the port number in this code?
error message
None
Global Rank: 0 Local Rank: 0
Killing subprocess 659577
Traceback (most recent call last):
File "train.py", line 299, in<module>
torch.distributed.init_process_group(backend='nccl',
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in<module>main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--dataset', 'cityscapes', '--cv', '0', '--syncbn', '--apex', '--fp16', '--bs_val', '1', '--eval', 'folder', '--eval_folder', '/workspace/lyft_trainval_images', '--dump_assets', '--dump_all_images', '--n_scales', '0.5,1.0,2.0', '--snapshot', 'large_asset_dir/seg_weights/cityscapes_ocrnet.HRNet_Mscale_outstanding-turtle.pth', '--arch', 'ocrnet.HRNet_Mscale', '--result_dir', 'logs/dump_folder/frisky-serval_2022.06.17_17.15']' returned non-zero exit status 1.
The text was updated successfully, but these errors were encountered:
I tried to run this model to evaluate dumy folders at the same time with one GPU (A100) which has 80G.
When I tried to evaluate one folder, it works well. However, if I try to run the other one additionally, an error appears.
It seems that there is a problem with the pytorch.distributed package.
When I googled it, people said if I change the port number, this problem will be solved.
Do you �know how to change the port number in this code?
error message
The text was updated successfully, but these errors were encountered: