Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug 🐞]: Cannot deploy VM in Australia, in Europe i can #419

Closed
sony87 opened this issue Feb 25, 2024 · 7 comments
Closed

[Bug 🐞]: Cannot deploy VM in Australia, in Europe i can #419

sony87 opened this issue Feb 25, 2024 · 7 comments
Assignees
Labels
type_bug Something isn't working

Comments

@sony87
Copy link

sony87 commented Feb 25, 2024

What happened?

I'm in Europe and can deploy VMs in Europe farms without issue. If i try to deploy to any Australia farm/node it fails after 10 minutes, constantly.

What did you expect?

To be able to deploy everywhere despite my location.

What browsers are you seeing the problem on?

No response

ZOS info

No response

Dashboard info

No response

weblets info

No response

Relevant log output

No response

@xmonader
Copy link

can you please add the farm id or the node id that you tried to deploy on?

@xmonader
Copy link

check 4985 and 2594, couldn't deploy on both

it kept giving Waiting for deployment with contract_id: 236293 to be ready and Waiting for deployment with contract_id: 236290 to be ready

@sony87
Copy link
Author

sony87 commented Feb 25, 2024

Nodes: 4349, 4350,
on Farm "Mango Farm" most of the nodes does not work, 2595, 2596, 2636 etc....

@khaledyoussef24 khaledyoussef24 added the type_bug Something isn't working label Feb 26, 2024
@PeterNashaat PeterNashaat moved this to In Progress in 3.13.x Feb 27, 2024
@sabrinasadik
Copy link

The problem might be caused by latency to the hub. This in turn could cause the deployment to time out while it's fetching data from the hub (probably when copying a disk image from 0-fs to the local disk). If this is indeed the problem, it can be verified as follows:

  • check on metrics.grid.tf: you should see network usage at the time of the deployment which lasts for more than 10 minutes and can be considered slow
  • If you verify yourself: start a VM with a disk image which is not on the node you're deploying on.
  • check on metrics, you will see a relatively consistent network usage.
  • after a while (about 10min) the deployment will time out.
  • network usage will still be the same.
  • after some more time, the network usage will drop again (this means the disk image finished downloading).
  • if you now deploy the same disk image again, it should work.

I'm assuming the disk copy keeps running after the deployment time-out. If that is not the case, you'll have to redeploy a couple of times possibly, until the disk image is in the 0-fs cache completely.

If this is indeed the case, then there either needs to be a workaround in zos or the actual solution is to make sure that the hub is present in multiple geographic regions so latency is consistently low (distributed hub or some kind of cdn thing).

@PeterNashaat
Copy link
Member

Deploying vm on TheBatcave farm-id 2252, node-id 4985
  • VM with ubuntu 22 flist which was already downloaded on the node working fine
    • ZOS logs :
 [+] flistd: 2024-02-27T09:25:36Z info flist already in on the filesystem url=https://hub.grid.tf/tf-official-vms/ubuntu-22.04.flist
  • While deploying with nixos flist which was not used before on that node
    • ZOS logs :
2024-02-27 14:34:00 | [+] flistd: 2024-02-27T13:34:00Z info request to mount flist: {ReadOnly:true Limit:0 Storage: PersistedVolume:} name=cloud-container:c65ef166512f3d5fe7c61fc3d8dd3c89 storage= url=https://hub.grid.tf/tf-autobuilder/cloud-container-8730b6f.flist
-- | --
  |   | 2024-02-27 14:33:57 | [+] identityd: 2024-02-27T13:33:57Z info checking for update after milliseconds wait=4440000
  |   | 2024-02-27 14:33:57 | [+] identityd: 2024-02-27T13:33:57Z info checking if update is required current=3.9.0 latest=3.9.0
  |   | 2024-02-27 14:33:56 | [+] flistd: 2024-02-27T13:33:56Z info starting g8ufs daemon args=["--cache","/var/cache/modules/flistd/cache","--meta","/var/cache/modules/flistd/flist/fa05b43ad1c5362453cb70de7cea9664","--daemon","--log","/var/cache/modules/flistd/log/fa05b43ad1c5362453cb70de7cea9664.log"] storage= url=https://hub.grid.tf/tf-official-vms/nixos-22.11.flist
  |   | 2024-02-27 14:33:54 | [+] flistd: 2024-02-27T13:33:54Z info request to mount flist storage= url=https://hub.grid.tf/tf-official-vms/nixos-22.11.flist
  |   | 2024-02-27 14:33:54 | [+] flistd: 2024-02-27T13:33:54Z info request to mount flist: {ReadOnly:true Limit:0 Storage: PersistedVolume:} name=604-240316-thebatcavetest2 storage= url=https://hub.grid.tf/tf-official-vms/nixos-22.11.flist

image
  • Node Network Traffic was at it's peake and getting higher each minute as you can see from these 2 screenshots :
image image
  • From Dashboard, first it was waiting for vm to be ready
Waiting for deployment with contract_id: 240316 to be ready
  • Then got this error.
Failed to send request to twinId 7688 with command: zos.deployment.get, payload: {"contract_id":240316} Didn't get a response after 20 seconds
  • Then Contracts got Cancled
    • ZOS logs :
image - Network Traffic still getting higher : image
  • Tried deploying nixos again, after network traffic decreased
    • ZOS logs :
[+] flistd: 2024-02-27T14:05:57Z info flist already in on the filesystem url=https://hub.grid.tf/tf-official-vms/nixos-22.11.flist

  • Network Traffic :
image
  • VM was deployed successfully
image
  • Did a quick speed test on the vm
root@thebatcavetest:~# speedtest-cli
Retrieving speedtest.net configuration...
Testing from Aussie Broadband (159.196.171.188)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by Superloop Australia Pty Ltd (Sydney) [0.09 km]: 16.961 ms
Testing download speed................................................................................
Download: 269.32 Mbit/s
Testing upload speed......................................................................................................
Upload: 23.58 Mbit/s

@sabrinasadik Confirmed flist download from the hub takes long time, which cause a timeout on dashboard side then cancelling the contracts, but downloading the flist continues and deploying it again works after download is done.

@sony87
Copy link
Author

sony87 commented Feb 27, 2024

So what you are saying is that i need to stay and re-deploying on the same machine untill it comples ?

@sabrinasadik
Copy link

Until we have a workaround or fix the issue, yes. @xmonader let's discuss further to have a solution for this.

@PeterNashaat PeterNashaat moved this from In Progress to In Verification in 3.13.x Feb 28, 2024
@github-project-automation github-project-automation bot moved this from In Verification to Done in 3.13.x Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_bug Something isn't working
Projects
No open projects
Status: Done
Development

No branches or pull requests

6 participants