GPU load balancing when running multiple job #7

theorist17 · 2022-04-27T12:02:27Z

Hi, I am trying to create 4 jobs, each with 2 processes, on 4x 12GB GPUs. (using docker_run.py --headless, built based on Dockerfile_tiffany)
I was assuming that each job would be running on each isolated GPU so that the memory usage on each GPU would be roughly the same.

But I find this is not the case; each GPU are using a different amount of memory.
It seems that whenever a new job is created, some amount of GPU memory is allocated to GPU #0, although I specified different GPU # other than 0 by setting:
--which_gpu 1 --sem_gpu_id 1 --sem_seg_gpu 1 --depth_gpu 1

Why is this happening? How can I balance the GPU memory load for better utilization?
(Maybe some part of the code is running on the default GPU (GPU #0)?)

My bash script: run_tests_unseen.sh

#!/bin/bash

# total sample in test_unseen 1529

# trap ctrl-c and call ctrl_c()
trap ctrl_c INT
function ctrl_c() {
    echo "** Trapped CTRL-C"

    END=$(date +%s.%N)
    ENDTIME=$(date)
    DIFF=$(echo "$END - $START" | bc)
    echo "Start $START ($STARTTIME)"
    echo "End $END ($ENDTIME)"
    echo "Diff $DIFF seconds"
}

# clean up
function kill_descendant_processes() {
    echo "** Trapped kill_descendant_processes $1"
    local pid="$1"
    local and_self="${2:-false}"
    if children="$(pgrep -P "$pid")"; then
        for child in $children; do
            echo "   parent $pid has child $child"
            kill_descendant_processes "$child" true
        done
    fi
    if [[ "$and_self" == true ]]; then
        echo "   killing $pid"
        kill -9 "$pid"
    fi
}
trap "kill_descendant_processes $$" EXIT
# trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT
# trap "killall background" EXIT
# trap "pkill -P $$" EXIT
# trap "pkill -9 python && pkill -9 thor && pkill -9 Xorg" EXIT

# remove existing pictures
rm -rf pictures/tests_unseen/first_run_0
rm -rf pictures/tests_unseen/first_run_1
rm -rf pictures/tests_unseen/first_run_2
rm -rf pictures/tests_unseen/first_run_3

# measure time & count failures
declare -a PIDS=()
FAIL=0
START=$(date +%s.%N)
STARTTIME=$(date)

# run process background & collect PID for wait

# gpu 0

export DISPLAY=:0 && python main.py \
-n1 --max_episode_length 1000 --num_local_steps 25 \
--num_processes 2 --eval_split tests_unseen \
--from_idx 0 --to_idx 2 \
--x_display 0  \
--max_fails 10  \
--debug_local --learned_depth  --use_sem_seg \
--which_gpu 0 --sem_gpu_id 0 --sem_seg_gpu 0 --depth_gpu 0 \
--set_dn first_run_0 \
--use_sem_policy  --save_pictures &
echo "Running PID: $!"
PIDS+=($!)
echo "Waiting PIDS: ${PIDS[*]}"
sleep 20

# gpu 1

export DISPLAY=:1 && python main.py \
-n1 --max_episode_length 1000 --num_local_steps 25 \
--num_processes 2 --eval_split tests_unseen \
--from_idx 2 --to_idx 4 \
--x_display 1  \
--max_fails 10  \
--debug_local --learned_depth  --use_sem_seg \
--which_gpu 1 --sem_gpu_id 1 --sem_seg_gpu 1 --depth_gpu 1 \
--set_dn first_run_1 \
--use_sem_policy  --save_pictures &
echo "Running PID: $!"
PIDS+=($!)
echo "Waiting PIDS: ${PIDS[*]}"
sleep 20

# gpu 2

export DISPLAY=:2 && python main.py \
-n1 --max_episode_length 1000 --num_local_steps 25 \
--num_processes 2 --eval_split tests_unseen \
--from_idx 4 --to_idx 6 \
--x_display 2  \
--max_fails 10  \
--debug_local --learned_depth  --use_sem_seg \
--which_gpu 2 --sem_gpu_id 3 --sem_seg_gpu 3 --depth_gpu 2 \
--set_dn first_run_2 \
--use_sem_policy  --save_pictures &
echo "Running PID: $!"
PIDS+=($!)
echo "Waiting PIDS: ${PIDS[*]}"
sleep 20


# gpu 3

export DISPLAY=:3 && python main.py \
-n1 --max_episode_length 1000 --num_local_steps 25 \
--num_processes 2 --eval_split tests_unseen \
--from_idx 6 --to_idx 8 \
--x_display 3  \
--max_fails 10  \
--debug_local --learned_depth  --use_sem_seg \
--which_gpu 2 --sem_gpu_id 3 --sem_seg_gpu 3 --depth_gpu 2 \
--set_dn first_run_3 \
--use_sem_policy  --save_pictures &
echo "Running PID: $!"
PIDS+=($!)
echo "Waiting PIDS: ${PIDS[*]}"
sleep 20

# wait all "python main.py" process
for PID in "${PIDS[@]}"; do
    wait "$PID" || let "FAIL+=1"
done

# report failures
if [ "$FAIL" == "0" ];
then
echo "ALL PROCESS SUCCESSFULLY FINISHED!"
else
echo "FAIL! ($FAIL)"
fi

# report time
END=$(date +%s.%N)
ENDTIME=$(date)
DIFF=$(echo "$END - $START" | bc)
echo "Start $START ($STARTTIME)"
echo "End $END ($ENDTIME)"
echo "Diff $DIFF seconds"

My startx script: run_xserver.sh

#!/bin/bash

trap "kill_descendant_processes $$" EXIT
function kill_descendant_processes() {
    echo "** Trapped kill_descendant_processes $1"
    local pid="$1"
    local and_self="${2:-false}"
    if children="$(pgrep -P "$pid")"; then
        for child in $children; do
            echo "   parent $pid has child $child"
            kill_descendant_processes "$child" true
        done
    fi
    if [[ "$and_self" == true ]]; then
        echo "   killing $pid"
        kill -9 "$pid"
    fi
}

declare -a PIDS=()

python alfred_utils/scripts/startx.py 0 &
sleep 2
echo "Running PID: $!"
PIDS+=($!)
echo "Waiting PIDS: ${PIDS[*]}"

python alfred_utils/scripts/startx.py 1 &
sleep 2
echo "Running PID: $!"
PIDS+=($!)
echo "Waiting PIDS: ${PIDS[*]}"

python alfred_utils/scripts/startx.py 2 &
sleep 2
echo "Running PID: $!"
PIDS+=($!)
echo "Waiting PIDS: ${PIDS[*]}"

python alfred_utils/scripts/startx.py 3 &
sleep 2
echo "Running PID: $!"
PIDS+=($!)
echo "Waiting PIDS: ${PIDS[*]}"

wait

When loaded to only GPU 0

Every 2.0s: nvidia-smi -lms 100                                                                                                dualarm-server: Wed Apr 27 22:35:00 2022

Wed Apr 27 22:35:00 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     Off  | 00000000:02:00.0 Off |                  N/A |
| 36%   63C    P2    82W / 250W |   7480MiB / 12194MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN Xp     Off  | 00000000:03:00.0 Off |                  N/A |
| 31%   44C    P8    10W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN Xp     Off  | 00000000:82:00.0 Off |                  N/A |
| 23%   33C    P8     9W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN Xp     Off  | 00000000:83:00.0 Off |                  N/A |
| 23%   39C    P8    10W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     16264      G   /usr/lib/xorg/Xorg                 12MiB |
|    0   N/A  N/A     25326      G   ...thor-201909061227-Linux64       58MiB |
|    0   N/A  N/A     25768      C   python                           1985MiB |
|    0   N/A  N/A     25907      C   /custom/conda/bin/python         2645MiB |
|    0   N/A  N/A     25908      C   /custom/conda/bin/python         2643MiB |
|    0   N/A  N/A     25948      G   ...thor-201909061227-Linux64       64MiB |
|    0   N/A  N/A     25949      G   ...thor-201909061227-Linux64       63MiB |
|    1   N/A  N/A     16264      G   /usr/lib/xorg/Xorg                  8MiB |
|    2   N/A  N/A     16264      G   /usr/lib/xorg/Xorg                  8MiB |
|    3   N/A  N/A     16264      G   /usr/lib/xorg/Xorg                  8MiB |
+-----------------------------------------------------------------------------+

When loaded to only GPU 1

Every 2.0s: nvidia-smi -lms 100                                                                                                dualarm-server: Wed Apr 27 22:32:22 2022

Wed Apr 27 22:32:22 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     Off  | 00000000:02:00.0 Off |                  N/A |
| 38%   65C    P2    77W / 250W |   3713MiB / 12194MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN Xp     Off  | 00000000:03:00.0 Off |                  N/A |
| 35%   61C    P2    70W / 250W |   6054MiB / 12196MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN Xp     Off  | 00000000:82:00.0 Off |                  N/A |
| 23%   33C    P8     9W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN Xp     Off  | 00000000:83:00.0 Off |                  N/A |
| 23%   39C    P8    11W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

When loaded to all 4 GPUs (before CUDA OOM error)

Wed Apr 27 13:46:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     Off  | 00000000:02:00.0 Off |                  N/A |
|ERR!   48C    P8   ERR! / 250W |  10245MiB / 12194MiB |     78%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN Xp     Off  | 00000000:03:00.0 Off |                  N/A |
| 35%   60C    P2   111W / 250W |   4185MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN Xp     Off  | 00000000:82:00.0 Off |                  N/A |
| 30%   43C    P8    10W / 250W |   2999MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN Xp     Off  | 00000000:83:00.0 Off |                  N/A |
| 37%   63C    P2    68W / 250W |    225MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Log when running all GPUs

bash run_tests_unseen.sh 
Running PID: 8992
Waiting PIDS: 8992
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
dn is  first_run_0
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Preloaded 'ScreenSelector.so'
Display 0 '0': 1024x768 (primary device).
Display 1 '1': 1024x768 (secondary device).
Display 2 '2': 1024x768 (secondary device).
Display 3 '3': 1024x768 (secondary device).
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
Preloaded 'ScreenSelector.so'
Display 0 '0': 1024x768 (primary device).
Display 1 '1': 1024x768 (secondary device).
Display 2 '2': 1024x768 (secondary device).
Display 3 '3': 1024x768 (secondary device).
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
ThorEnv started.
ThorEnv started.
instruction goal is  examine a bowl by the light of a lamp
self.goal_idx2cat is  {0: 'Knife', 1: 'SinkBasin', 2: 'ArmChair', 3: 'BathtubBasin', 4: 'Bed', 5: 'Cabinet', 6: 'Cart', 7: 'CoffeeMachine', 8: 'CoffeeTable', 9: 'CounterTop', 10: 'Desk', 11: 'DiningTable', 12: 'Drawer', 13: 'Dresser', 14: 'Fridge', 15: 'GarbageCan', 16: 'Microwave', 17: 'Ottoman', 18: 'Safe', 19: 'Shelf', 20: 'SideTable', 21: 'Sofa', 22: 'StoveBurner', 23: 'TVStand', 24: 'Toilet', 29: 'Bowl', 30: 'FloorLamp', 34: 'None'}
Resetting ThorEnv
instruction goal is  examine a grey bowl in the light of a lamp
self.goal_idx2cat is  {0: 'Knife', 1: 'SinkBasin', 2: 'ArmChair', 3: 'BathtubBasin', 4: 'Bed', 5: 'Cabinet', 6: 'Cart', 7: 'CoffeeMachine', 8: 'CoffeeTable', 9: 'CounterTop', 10: 'Desk', 11: 'DiningTable', 12: 'Drawer', 13: 'Dresser', 14: 'Fridge', 15: 'GarbageCan', 16: 'Microwave', 17: 'Ottoman', 18: 'Safe', 19: 'Shelf', 20: 'SideTable', 21: 'Sofa', 22: 'StoveBurner', 23: 'TVStand', 24: 'Toilet', 25: 'Bowl', 26: 'FloorLamp', 34: 'None'}
Resetting ThorEnv
Task: Examine a grey bowl in the light of a lamp.
Running PID: 9240
Waiting PIDS: 8992 9240
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
dn is  first_run_1
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Preloaded 'ScreenSelector.so'
Display 0 '0': 1024x768 (primary device).
Display 1 '1': 1024x768 (secondary device).
Display 2 '2': 1024x768 (secondary device).
Display 3 '3': 1024x768 (secondary device).
Preloaded 'ScreenSelector.so'
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
Display 0 '0': 1024x768 (primary device).
Display 1 '1': 1024x768 (secondary device).
Display 2 '2': 1024x768 (secondary device).
Display 3 '3': 1024x768 (secondary device).
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
ThorEnv started.
ThorEnv started.
Running PID: 9492
Waiting PIDS: 8992 9240 9492
instruction goal is  move two dog sculptures to the coffee table 
instruction goal is  grab the grey bowl on the corner table turn on the lamp
self.goal_idx2cat is  {0: 'Knife', 1: 'SinkBasin', 2: 'ArmChair', 3: 'BathtubBasin', 4: 'Bed', 5: 'Cabinet', 6: 'Cart', 7: 'CoffeeMachine', 8: 'CoffeeTable', 9: 'CounterTop', 10: 'Desk', 11: 'DiningTable', 12: 'Drawer', 13: 'Dresser', 14: 'Fridge', 15: 'GarbageCan', 16: 'Microwave', 17: 'Ottoman', 18: 'Safe', 19: 'Shelf', 20: 'SideTable', 21: 'Sofa', 22: 'StoveBurner', 23: 'TVStand', 24: 'Toilet', 29: 'Statue', 34: 'None'}
self.goal_idx2cat is  {0: 'Knife', 1: 'SinkBasin', 2: 'ArmChair', 3: 'BathtubBasin', 4: 'Bed', 5: 'Cabinet', 6: 'Cart', 7: 'CoffeeMachine', 8: 'CoffeeTable', 9: 'CounterTop', 10: 'Desk', 11: 'DiningTable', 12: 'Drawer', 13: 'Dresser', 14: 'Fridge', 15: 'GarbageCan', 16: 'Microwave', 17: 'Ottoman', 18: 'Safe', 19: 'Shelf', 20: 'SideTable', 21: 'Sofa', 22: 'StoveBurner', 23: 'TVStand', 24: 'Toilet', 25: 'Bowl', 26: 'FloorLamp', 34: 'None'}
Resetting ThorEnv
Resetting ThorEnv
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
dn is  first_run_2
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Preloaded 'ScreenSelector.so'
Display 0 '0': 1024x768 (primary device).
Display 1 '1': 1024x768 (secondary device).
Display 2 '2': 1024x768 (secondary device).
Display 3 '3': 1024x768 (secondary device).
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Preloaded 'ScreenSelector.so'
Display 0 '0': 1024x768 (primary device).
Display 1 '1': 1024x768 (secondary device).
Display 2 '2': 1024x768 (secondary device).
Display 3 '3': 1024x768 (secondary device).
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
ThorEnv started.
ThorEnv started.
Task: Move two dog sculptures to the coffee table. 
Running PID: 9741
Waiting PIDS: 8992 9240 9492 9741
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
dn is  first_run_3
instruction goal is  to move two statues to the living room table
self.goal_idx2cat is  {0: 'Knife', 1: 'SinkBasin', 2: 'ArmChair', 3: 'BathtubBasin', 4: 'Bed', 5: 'Cabinet', 6: 'Cart', 7: 'CoffeeMachine', 8: 'CoffeeTable', 9: 'CounterTop', 10: 'Desk', 11: 'DiningTable', 12: 'Drawer', 13: 'Dresser', 14: 'Fridge', 15: 'GarbageCan', 16: 'Microwave', 17: 'Ottoman', 18: 'Safe', 19: 'Shelf', 20: 'SideTable', 21: 'Sofa', 22: 'StoveBurner', 23: 'TVStand', 24: 'Toilet', 29: 'Statue', 34: 'None'}
Resetting ThorEnv
instruction goal is  put 2 dog decorations front to back on the edge of the right side of the table 
self.goal_idx2cat is  {0: 'Knife', 1: 'SinkBasin', 2: 'ArmChair', 3: 'BathtubBasin', 4: 'Bed', 5: 'Cabinet', 6: 'Cart', 7: 'CoffeeMachine', 8: 'CoffeeTable', 9: 'CounterTop', 10: 'Desk', 11: 'DiningTable', 12: 'Drawer', 13: 'Dresser', 14: 'Fridge', 15: 'GarbageCan', 16: 'Microwave', 17: 'Ottoman', 18: 'Safe', 19: 'Shelf', 20: 'SideTable', 21: 'Sofa', 22: 'StoveBurner', 23: 'TVStand', 24: 'Toilet', 25: 'Statue', 34: 'None'}
Resetting ThorEnv
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Auto GPU config:
Number of processes: 5
Number of processes on GPU 0: 2
Number of processes per GPU: 1
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Preloaded 'ScreenSelector.so'
Display 0 '0': 1024x768 (primary device).
Display 1 '1': 1024x768 (secondary device).
Display 2 '2': 1024x768 (secondary device).
Display 3 '3': 1024x768 (secondary device).
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Preloaded 'ScreenSelector.so'
Display 0 '0': 1024x768 (primary device).
Display 1 '1': 1024x768 (secondary device).
Display 2 '2': 1024x768 (secondary device).
Display 3 '3': 1024x768 (secondary device).
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
Task: Examine a bowl by the light of a lamp.
ThorEnv started.
ThorEnv started.
Task: Put 2 dog decorations front to back on the edge of the right side of the table. 
Process ForkServerProcess-1:
Traceback (most recent call last):
  File "/custom/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/custom/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hongin/FILM/envs/utils/vector_env.py", line 179, in _worker_env
    env = env_fn(*env_fn_args)
  File "/home/hongin/FILM/envs/__init__.py", line 122, in make_env_fn_alfred
    env = Sem_Exp_Env_Agent_Thor(args, scene_names, rank)
  File "/home/hongin/FILM/agents/sem_exp_thor.py", line 77, in __init__
    self.seg = SemgnetationHelper(self)
  File "/home/hongin/FILM/models/segmentation/segmentation_helper.py", line 27, in __init__
    self.sem_seg_model_alfw_large = load_pretrained_model('models/segmentation/maskrcnn_alfworld/receps_lr5e-3_003.pth', torch.device("cuda:0" if args.cuda else "cpu"), 'recep')
  File "/home/hongin/FILM/models/segmentation/alfworld_mrcnn.py", line 90, in load_pretrained_model
    mask_rcnn.load_state_dict(torch.load(path, map_location=device))
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 592, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 851, in _load
    result = unpickler.load()
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 843, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 832, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 812, in restore_location
    return default_restore_location(storage, str(map_location))
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 157, in _cuda_deserialize
    return obj.cuda(device)
  File "/custom/conda/lib/python3.9/site-packages/torch/_utils.py", line 80, in _cuda
    return new_type(self.size()).copy_(self, non_blocking)
  File "/custom/conda/lib/python3.9/site-packages/torch/cuda/__init__.py", line 484, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory
Process ForkServerProcess-2:
Traceback (most recent call last):
  File "/custom/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/custom/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hongin/FILM/envs/utils/vector_env.py", line 179, in _worker_env
    env = env_fn(*env_fn_args)
  File "/home/hongin/FILM/envs/__init__.py", line 122, in make_env_fn_alfred
    env = Sem_Exp_Env_Agent_Thor(args, scene_names, rank)
  File "/home/hongin/FILM/agents/sem_exp_thor.py", line 77, in __init__
    self.seg = SemgnetationHelper(self)
  File "/home/hongin/FILM/models/segmentation/segmentation_helper.py", line 27, in __init__
    self.sem_seg_model_alfw_large = load_pretrained_model('models/segmentation/maskrcnn_alfworld/receps_lr5e-3_003.pth', torch.device("cuda:0" if args.cuda else "cpu"), 'recep')
  File "/home/hongin/FILM/models/segmentation/alfworld_mrcnn.py", line 90, in load_pretrained_model
    mask_rcnn.load_state_dict(torch.load(path, map_location=device))
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 592, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 851, in _load
    result = unpickler.load()
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 843, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 832, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 812, in restore_location
    return default_restore_location(storage, str(map_location))
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/custom/conda/lib/python3.9/site-packages/torch/serialization.py", line 157, in _cuda_deserialize
    return obj.cuda(device)
  File "/custom/conda/lib/python3.9/site-packages/torch/_utils.py", line 80, in _cuda
    return new_type(self.size()).copy_(self, non_blocking)
  File "/custom/conda/lib/python3.9/site-packages/torch/cuda/__init__.py", line 484, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory
Traceback (most recent call last):
  File "/home/hongin/FILM/main.py", line 831, in <module>
    main()
  File "/home/hongin/FILM/main.py", line 116, in main
    envs = make_vec_envs(args)
  File "/home/hongin/FILM/envs/__init__.py", line 15, in make_vec_envs
    envs = construct_envs_alfred(args)
  File "/home/hongin/FILM/envs/__init__.py", line 137, in construct_envs_alfred
    envs = VectorEnv(make_env_fn=make_env_fn_alfred,
  File "/home/hongin/FILM/envs/utils/vector_env.py", line 149, in __init__
    self.observation_spaces = [
  File "/home/hongin/FILM/envs/utils/vector_env.py", line 150, in <listcomp>
    read_fn() for read_fn in self._connection_read_fns
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <function VectorEnv.__del__ at 0x7f67ef0f1ee0>
Traceback (most recent call last):
  File "/home/hongin/FILM/envs/utils/vector_env.py", line 767, in __del__
    self.close()
  File "/home/hongin/FILM/envs/utils/vector_env.py", line 567, in close
    write_fn((CLOSE_COMMAND, None))
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Process ForkServerProcess-1:
Traceback (most recent call last):
  File "/home/hongin/FILM/agents/sem_exp_thor.py", line 168, in load_initial_scene
    obs, info = self.setup_scene(traj_data, task_type, r_idx, self.args)
  File "/home/hongin/FILM/agents/sem_exp_thor.py", line 335, in setup_scene
    obs, seg_print = self._preprocess_obs(obs)
  File "/home/hongin/FILM/agents/sem_exp_thor.py", line 1358, in _preprocess_obs
    sem_seg_pred = self.seg.get_sem_pred(rgb.astype(np.uint8)) #(300, 300, num_cat)
  File "/home/hongin/FILM/models/segmentation/segmentation_helper.py", line 241, in get_sem_pred
    self.get_instance_mask_seg_alfworld_both()
  File "/home/hongin/FILM/models/segmentation/segmentation_helper.py", line 114, in get_instance_mask_seg_alfworld_both
    results_large = self.sem_seg_model_alfw_large(im_tensors)[0]
  File "/custom/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/custom/conda/lib/python3.9/site-packages/torchvision/models/detection/generalized_rcnn.py", line 94, in forward
    features = self.backbone(images.tensors)
  File "/custom/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/custom/conda/lib/python3.9/site-packages/torchvision/models/detection/backbone_utils.py", line 44, in forward
    x = self.body(x)
  File "/custom/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/custom/conda/lib/python3.9/site-packages/torchvision/models/_utils.py", line 63, in forward
    x = module(x)
  File "/custom/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/custom/conda/lib/python3.9/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/custom/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/custom/conda/lib/python3.9/site-packages/torchvision/models/resnet.py", line 133, in forward
    out = self.bn3(out)
  File "/custom/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/custom/conda/lib/python3.9/site-packages/torchvision/ops/misc.py", line 96, in forward
    return x * scale + bias
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.91 GiB total capacity; 486.21 MiB already allocated; 9.12 MiB free; 492.00 MiB reserved in total by PyTorch)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/custom/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/custom/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hongin/FILM/envs/utils/vector_env.py", line 222, in _worker_env
    obs, info, actions_dict = env.load_initial_scene()
  File "/home/hongin/FILM/agents/sem_exp_thor.py", line 187, in load_initial_scene
    obs = np.zeros(self.obs.shape)
AttributeError: 'NoneType' object has no attribute 'shape'
Traceback (most recent call last):
  File "/home/hongin/FILM/main.py", line 831, in <module>
    main()
  File "/home/hongin/FILM/main.py", line 121, in main
    obs, infos, actions_dicts = envs.load_initial_scene()
  File "/home/hongin/FILM/envs/__init__.py", line 71, in load_initial_scene
    obs, info, actions_dict = self.venv.load_initial_scene()
  File "/home/hongin/FILM/envs/utils/vector_env.py", line 475, in load_initial_scene
    results.append(read_fn())
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError
Exception ignored in: <function VectorEnv.__del__ at 0x7f18a0f87ee0>
Traceback (most recent call last):
  File "/home/hongin/FILM/envs/utils/vector_env.py", line 767, in __del__
    self.close()
  File "/home/hongin/FILM/envs/utils/vector_env.py", line 564, in close
    read_fn()
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 255, in recv
    buf = self._recv_bytes()
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/custom/conda/lib/python3.9/multiprocessing/connection.py", line 388, in _recv
    raise EOFError
EOFError:

The text was updated successfully, but these errors were encountered:

theorist17 · 2022-04-27T15:49:07Z

Plus, when using multiple Xorg processes (several startx.py with different displays), I find that semantic pictures are not properly visualized. ($ALFRED_ROOT/pictures/tests_unseen/first_run_0/Sem/Sem_*.png)

However, when using a single Xorg process (only one startx.py), semantic pictures are properly visualized.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU load balancing when running multiple job #7

GPU load balancing when running multiple job #7

theorist17 commented Apr 27, 2022 •

edited

Loading

theorist17 commented Apr 27, 2022 •

edited

Loading

GPU load balancing when running multiple job #7

GPU load balancing when running multiple job #7

Comments

theorist17 commented Apr 27, 2022 • edited Loading

theorist17 commented Apr 27, 2022 • edited Loading

theorist17 commented Apr 27, 2022 •

edited

Loading

theorist17 commented Apr 27, 2022 •

edited

Loading