Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the performance and time taken by the mass_deployments workflow #1352

Closed
Tracked by #1489
mohamedamer453 opened this issue Nov 2, 2023 · 15 comments
Closed
Tracked by #1489
Assignees
Labels
grid_client type_bug Something isn't working
Milestone

Comments

@mohamedamer453
Copy link
Contributor

Description

Investigate the performance and how long it takes for the mass_deployments workflow to run.

The latest workflow run took almost 3 hours for 100 deployments. mainly because some of the failed deployments took ~10 minutes to timeout as seen in https://github.com/threefoldtech/tfgrid-sdk-ts/actions/runs/6727307503/job/18284925859#step:6:3325

@mohamedamer453 mohamedamer453 added type_bug Something isn't working grid_client labels Nov 2, 2023
@xmonader xmonader added this to 3.13.x Nov 3, 2023
@xmonader xmonader moved this to Accepted in 3.13.x Nov 3, 2023
@xmonader xmonader added this to the 2.3.0 milestone Nov 3, 2023
@AlaaElattar AlaaElattar self-assigned this Nov 9, 2023
@AlaaElattar AlaaElattar moved this from Accepted to In Progress in 3.13.x Nov 9, 2023
@AlaaElattar
Copy link
Contributor

Work in Progress (WIP):

@AlaaElattar
Copy link
Contributor

Work in Progress (WIP):

  • I'm trying to start by running only 10 deployments and track their time.
  • But I've got the same error for them all, Error getting node x, didn't respond after 10 seconds on dev.
  • I've tried to run on qa but got same error for 10 deployments.
  • That's all with removing randomize from both nodes and farms.

@AlaaElattar
Copy link
Contributor

Work in Progress (WIP):

  • I've increased timeout to 20 seconds.
  • I got Error: Error getting node 2: Error: Failed to send request to twinId 9 with command: zos.statistics.get, payload: due to 102 code: 102 for all 10 deployments but without timeout part.

@AlaaElattar AlaaElattar moved this from In Progress to Blocked in 3.13.x Nov 12, 2023
@AlaaElattar
Copy link
Contributor

Work in Progress (WIP):

@xmonader
Copy link
Contributor

Issue blocked on dysfunctional devnet

@AlaaElattar AlaaElattar moved this from Blocked to In Progress in 3.13.x Nov 13, 2023
@AlaaElattar
Copy link
Contributor

AlaaElattar commented Nov 14, 2023

Investigation :

  • If deployment created successfully will take avg 18 to 30 seconds. (If network is slow, the deployment will take 1 minute)
  • If failed because the error of sending request (code 102) will take 10 seconds (timeout of request).
  • If failed because Node 128 is not available for user with twinId: 214, maybe it's rented by another user or node is dedicated. use capacity planning with availableFor option. will take avg 3 to 4 seconds. (That's an issue needs to be handled here or from gridproxy)

@AlaaElattar
Copy link
Contributor

Work Completed::

  • Script updated to deploy batches of vms together.
  • For example, 1000 vms will be deployed each 100 batch together. If one on the batch failed, the whole batch will be failed.
  • This will decrease the time alot.

Investigation:

  • There's something called matrix in github workflows that runs many instances of the same workflow at same time.
  • I'm investigating how to use it.

@AlaaElattar AlaaElattar moved this from In Progress to Blocked in 3.13.x Nov 15, 2023
@AlaaElattar
Copy link
Contributor

  • Now I'm blocked as the account running the script on mainnet contains 0 TFTs, I need it to test time of worflow not just test it locally.

@AlaaElattar AlaaElattar moved this from Blocked to In Progress in 3.13.x Dec 3, 2023
@AlaaElattar
Copy link
Contributor

Work In Progress (WIP)

  • After running script on mainnet, I found that if the vm deployed successfully it will take avg 1 min which is still alot.
  • There're 2 errors many times either the node didn't respont after 10 seconds or can't get wg_ports .
  • I'm investigating these errors.

@AlaaElattar
Copy link
Contributor

Work In Progress (WIP)

  • I've updated that any offline node will be excluded from filterNodes.
  • Also, I've found that the batch takes less 1 min locally.
  • Still investigating any errors prevent batch from being deployed

@AlaaElattar
Copy link
Contributor

Work Completed:

  • I've updated pinging node to be parallel with using promise.all
  • Also, we've found that filter nodes takes alot (40 sec), so we will apply it once and then choose random from list.
  • The batch of 50 deployment takes about 10 min.

Work in progress (WIP)

  • working on using matrices to run 4 instances of mass_deployment each one of them has 250 vm.
  • each instance will use different account.

@AlaaElattar
Copy link
Contributor

AlaaElattar commented Dec 6, 2023

Work Completed:

  • Creating deployments updated to be parallel also.
  • Instead of using github matrix, we've decided to make mass deployment run 250 vm and schedule it every 6 hours.
  • 50 vms take from 7 to 8 min.

@zaelgohary zaelgohary moved this from In Progress to Pending review in 3.13.x Dec 7, 2023
@AlaaElattar
Copy link
Contributor

AlaaElattar commented Dec 7, 2023

Work In Progress (WIP):

  • working on pr comments.

@AlaaElattar
Copy link
Contributor

Work Completed:

  • All pr comments applied.

@zaelgohary zaelgohary moved this from Pending review to In Verification in 3.13.x Dec 10, 2023
@A-Harby
Copy link
Contributor

A-Harby commented Dec 14, 2023

Verified,

The mass deployment is now running in much less time than before, taking only about 20 minutes to run the full batch.

But the workflow sometimes passes and sometimes fails. I created a new issue for it here: #1699.

@A-Harby A-Harby moved this from In Verification to Done in 3.13.x Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grid_client type_bug Something isn't working
Projects
No open projects
Status: Done
Development

No branches or pull requests

5 participants