Skip to content

Commit

Permalink
[CI] Repeat start_instance (#361)
Browse files Browse the repository at this point in the history
Recently frequently occurs a fail of regression due to fail
start_instance due to "Insufficient capacity".

Repeat attempts to start instances 300 times with 60 seconds sleep between repeats.

Tested here
https://github.com/CentML/hidet/actions/runs/10000711025/job/27664169588
  • Loading branch information
vadiklyutiy committed Jul 22, 2024
1 parent b9551c4 commit f5f528f
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 9 deletions.
21 changes: 13 additions & 8 deletions .github/scripts/start_instances.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,14 +97,19 @@ def run_command(cmd):

# Start all instances
for instance in instances:
cloud_provider_id, instance_id, _ = instance
if cloud_provider_id == 1: # AWS
cmd = ['aws', 'ec2', 'start-instances', '--instance-ids', instance_id]
elif cloud_provider_id == 2: # Always on, no need to launch. Do Nothing.
cmd = ['true']
else:
raise ValueError(f'Unknown cloud provider id: {cloud_provider_id}')
output = run_command(cmd)
for i in range(300):
cloud_provider_id, instance_id, _ = instance
if cloud_provider_id == 1: # AWS
cmd = ['aws', 'ec2', 'start-instances', '--instance-ids', instance_id]
elif cloud_provider_id == 2: # Always on, no need to launch. Do Nothing.
cmd = ['true']
else:
raise ValueError(f'Unknown cloud provider id: {cloud_provider_id}')
output = run_command(cmd)
if output.returncode == 0:
break
time.sleep(60)

if output.returncode != 0:
raise RuntimeError(f'Failed to start instance {instance_id} on cloud provider {cloud_provider_id}.')

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/regression.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:

- name: Run main Python script
id: run_py_script
run: timeout 900 python ./.github/scripts/start_instances.py
run: timeout 36000 python ./.github/scripts/start_instances.py
env:
# TODO: Allow launching only specified GPU instances
HW_CONFIG: all
Expand Down

0 comments on commit f5f528f

Please sign in to comment.