Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: FAILED_TO_START_DUE_TO_NO_CAPACITY #1994

Closed
judeleonard opened this issue Nov 14, 2024 · 26 comments
Closed

[Bug]: FAILED_TO_START_DUE_TO_NO_CAPACITY #1994

judeleonard opened this issue Nov 14, 2024 · 26 comments
Labels
bug Something isn't working

Comments

@judeleonard
Copy link

judeleonard commented Nov 14, 2024

Steps to reproduce

create a fleet.dstack.yml to provision remote backend from my on_prem server. This was created successfully.

dstack apply -f fleet.dstack.yml

type: fleet
name: model-dev-fleet

placement: any

# The user, private SSH key, and hostnames of the on-prem servers
ssh_config:
  user: my_user
  identity_file: ~/.ssh/id_rsa
  hosts:
    - 33.33.48.1

create another yml config to provision a development server with the provisioned fleet as the backend

dstack apply -f dev_environment.yml

type: dev-environment
name: model-dev-env

#python: "3.11"  
image: dstackai/base:py3.13-0.6-cuda-12.1

ide: vscode

spot_policy: auto

Actual behaviour

I got the below error

All provisioning attempts failed. This is likely due to cloud providers not having enough capacity. Check CLI and server logs for more details.

Then also tried to see extra details about the error with the below command

dstack ps --verbose

output below

NAME           BACKEND  REGION  INSTANCE  RESOURCES  SPOT  PRICE  STATUS  SUBMITTED   ERROR                                
 model-dev-env                                                     failed  54 sec ago  JOB_FAILED                           
                                                                                       (FAILED_TO_START_DUE_TO_NO_CAPACITY) 

Expected behaviour

Instance provisioning should be completed successfully with a vscode link to my workspace.

dstack version

0.18.22

Server logs

[12:45:55] INFO     dstack._internal.server.services.backends:404 Requesting instance offers from backends: []                                                                                              
[12:45:56] INFO     dstack._internal.server.background.tasks.process_runs:330 run(110058)model-dev-env: run status has changed SUBMITTED -> TERMINATING                                                     
           INFO     dstack._internal.server.services.jobs:283 job(0d49c8)model-dev-env-0-0: job status is FAILED, reason: FAILED_TO_START_DUE_TO_NO_CAPACITY                                                
[12:45:58] INFO     dstack._internal.server.services.runs:739 run(110058)model-dev-env: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED

Additional information

No response

@judeleonard judeleonard added the bug Something isn't working label Nov 14, 2024
@peterschmidt85
Copy link
Contributor

@judeleonard please provide dstack fleet list output.

One of the reasons can be that the fleet instance have GPUs while the dev environment doesn't request any.

@judeleonard
Copy link
Author

Here is the output

 FLEET            INSTANCE  BACKEND       RESOURCES                   PRICE  STATUS      CREATED     
 model-dev-fleet  0         ssh (remote)  2xCPU, 0GB, 100.0GB (disk)  $0.0   terminated  3 hours ago 

@judeleonard
Copy link
Author

@judeleonard please provide dstack fleet list output.

One of the reasons can be that the fleet instance have GPUs while the dev environment doesn't request any.

Yes, I tried to attach a GPU before but I got the same error 'Not having enough capacity' And my remote server actually has both GPU and docker preinstalled

@peterschmidt85
Copy link
Contributor

peterschmidt85 commented Nov 14, 2024

Here is the output

It's dstack ps, not dstack fleet list

dstack by default offers only instances that match exactly the resources of the fleet

@judeleonard
Copy link
Author

This is the output. Not much details

NAME           BACKEND  REGION  RESOURCES  SPOT  PRICE  STATUS  SUBMITTED   
 model-dev-env                                           failed  31 mins ago 

@peterschmidt85
Copy link
Contributor

Means the fleet creation wasn't successful.

  1. Please try again to create the fleet, and post here the entire output.
  2. After that, also please post here the output of dstack server

This will help understand why fleet cound't be created

@judeleonard
Copy link
Author

dstack-server

This is the dastack webserver after the fleet was created. But I will try again and post the entire output

@judeleonard
Copy link
Author

I just recreated the fleet now


 dstack apply -f fleet.dstack.yml 

/usr/lib/python3/dist-packages/paramiko/transport.py:237: CryptographyDeprecationWarning: Blowfish has been deprecated
  "class": algorithms.Blowfish,
 Project        main             
 User           admin            
 Configuration  fleet.dstack.yml 
 Type           fleet            
 Fleet type     ssh              
 Nodes          1                
 Placement      any              

Found fleet model-dev-fleet. Configuration changes detected.
Re-create the fleet? [y/n]: y

 FLEET            INSTANCE  BACKEND       RESOURCES  PRICE  STATUS   CREATED     ERROR 
 model-dev-fleet  0         ssh (remote)             $0.0   pending  20 sec ago        

then my stack server log.

Could this fingerprint be an issue with my ssh user?

                 'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:00:13] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
[16:00:14] WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:00:19] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
          WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:00:24] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
[16:00:25] WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:00:29] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
          WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:00:34] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
[16:00:35] WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:00:39] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
          WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:00:43] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
          WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:00:48] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
          WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:00:52] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
          WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:00:57] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
          WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:01:03] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
          WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:01:08] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
          WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                                                                          
[16:01:13] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
          WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error:    
                   'RSAKey' object has no attribute 'fingerprint'                                                                                  

@peterschmidt85
Copy link
Contributor

Oh that's a known issue, it will be fixed in the next release but for now please do pip install paramiko -U and then restart the server, and try again. The issue must be gone

@judeleonard
Copy link
Author

Thank you. Will do that

@peterschmidt85 peterschmidt85 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 14, 2024
@judeleonard
Copy link
Author

hi @peterschmidt85 , sorry for updating you at this time. Our remote server was undergoing some updates.

So I later tried it after installing paramiko like you suggested and dstack server logs indeed changed from what I had before. This is my log now. I also updated dstack to the latest v0.18.25.

rename3

@peterschmidt85
Copy link
Contributor

@judeleonard Now that you've updated paramiko, please show the full output of creating the fleet (both dstack apply and dstack server outputs).

@judeleonard
Copy link
Author

dstack apply -f fleet.dstack.yml


 Project        main             
 User           admin            
 Configuration  fleet.dstack.yml 
 Type           fleet            
 Fleet type     ssh              
 Nodes          1                
 Placement      any              

Found fleet model-dev-fleet. Configuration changes detected.
Re-create the fleet? [y/n]: y

 FLEET            INSTANCE  BACKEND       RESOURCES  PRICE  STATUS   CREATED     ERROR 
 model-dev-fleet  0         ssh (remote)             $0.0   pending  16 sec ago    

dstack apply -f dev_environment.yml

Project                main                                         
User                   admin                                        
Configuration          dev_environment.yml                          
Type                   dev-environment                              
Resources              2..xCPU, 8GB.., 1xGPU (10GB), 100GB.. (disk) 
Max price              -                                            
Max duration           6h                                           
Spot policy            auto                                         
Retry policy           no                                           
Creation policy        reuse-or-create                              
Termination policy     destroy-after-idle                           
Termination idle time  5m                                           

Finished run model-dev-env already exists.
Override the run? [y/n]: y
model-dev-env provisioning completed (terminating)
All provisioning attempts failed. This is likely due to cloud providers not having enough capacity. Check CLI and server logs for more details.

@peterschmidt85
Copy link
Contributor

@judeleonard

FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED ERROR
model-dev-fleet 0 ssh (remote) $0.0 pending 16 sec ago

But what it showed then? Was it successful?

Running anything before the fleet is created doesn't make sense.

Lets try to understand why the fleet isn't created. Need logs for that.

@judeleonard
Copy link
Author

 dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:227 Failed to start instance model-dev-fleet-0 in 600 seconds. Terminating...                                                
[11:50:02] INFO     dstack._internal.server.services.fleets:363 Deleting fleets: ['model-dev-fleet']                                                                                                        
[11:50:09] INFO     dstack._internal.server.background.tasks.process_fleets:72 Automatic cleanup of an empty fleet model-dev-fleet                                                                          
           INFO     dstack._internal.server.background.tasks.process_fleets:78 Fleet model-dev-fleet deleted                                                                                                
[11:50:11] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/riBFx8B321nQ3v0rEhwYqXJBM'] was unsuccessful                                                                  
[11:50:16] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBF21nQ3v0rEhwYqXJBM'] was unsuccessful                                                                  
[11:50:22] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtxQ3v0rEhwYqXJBM'] was unsuccessful                                                                  
[11:50:27] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321nQ3v0rEYqXJBM'] was unsuccessful                                                                  
[11:50:31] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
[11:50:32] WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321nQ3v0XJBM'] was unsuccessful                                                                  
[11:50:37] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321wYqXJBM'] was unsuccessful                                                                  
[11:50:42] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
[11:50:43] WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321wYqXJBM'] was unsuccessful                                                                  
[11:50:48] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321nQ3vXJBM'] was unsuccessful                                                                  
[11:50:50] INFO     dstack._internal.server.services.backends:404 Requesting instance offers from backends: []                                                                                              
[11:50:54] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321nQYqXJBM'] was unsuccessful                                                                  
[11:50:58] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8BwYqXJBM'] was unsuccessful                                                                  
[11:51:04] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321nQ3v0rEBM'] was unsuccessful                                                                  
[11:51:10] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321nQ3v0rJBM'] was unsuccessful                                                                  
[11:51:12] INFO     dstack._internal.server.services.backends:404 Requesting instance offers from backends: []                                                                                              
           INFO     dstack._internal.server.background.tasks.process_runs:330 run(d0951d)model-dev-env: run status has changed SUBMITTED -> TERMINATING                                                     
[11:51:14] INFO     dstack._internal.server.services.jobs:283 job(4f4d88)model-dev-env-0-0: job status is FAILED, reason: FAILED_TO_START_DUE_TO_NO_CAPACITY                                                
[11:51:15] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321nQ3v0qXJBM'] was unsuccessful                                                                  
[11:51:16] INFO     dstack._internal.server.services.runs:739 run(d0951d)model-dev-env: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED                                                    
[11:51:21] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321nQ3XJBM'] was unsuccessful                                                                  
[11:51:26] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321nhwYqXJBM'] was unsuccessful                                                                                                                                                           
[11:55:31] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B3YqXJBM'] was unsuccessful                                                                  
[11:55:36] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
[11:55:37] WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321XJBM'] was unsuccessful                                                                  
[11:55:41] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGv0rEhwYqXJBM'] was unsuccessful                                                                  
[11:55:45] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGthwYqXJBM'] was unsuccessful                                                                  
[11:55:50] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B32YqXJBM'] was unsuccessful                                                                  
[11:55:55] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
[11:55:56] WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8qXJBM'] was unsuccessful                                                                  
[11:56:00] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B32qXJBM'] was unsuccessful                                                                  
[11:56:04] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtxYqXJBM'] was unsuccessful                                                                  
[11:56:09] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8B321wYqXJBM'] was unsuccessful                                                                  
[11:56:14] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
[11:56:15] WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiwYqXJBM'] was unsuccessful                                                                  
[11:56:20] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
[11:56:21] WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtx8qXJBM'] was unsuccessful                                                                  
[11:56:26] INFO     dstack._internal.server.background.tasks.process_instances:217 Adding ssh instance model-dev-fleet-0...                                                                                 
           WARNING  dstack._internal.server.background.tasks.process_instances:281 Provisioning instance model-dev-fleet-0 could not be completed because of the error: Deploy instance raised an error: SSH
                    connection to the jude@1.1.4.1:22 with keys ['SHA256:T59TCqbDm+dzO/rigiBFGtv0rEhwYqXJBM'] was unsuccessful                                                                  


@peterschmidt85
Copy link
Contributor

peterschmidt85 commented Nov 15, 2024

This clearly shows that dstack cannot connect to the instance using the provided key

@judeleonard Can you connect to the same host using the provided key via ssh -i <key path> jude@1.1.4.1?

@un-def Any ideas what could be wrong?

@judeleonard
Copy link
Author

Yes, I can connect to the same server via ssh from my terminal.

@judeleonard
Copy link
Author

judeleonard commented Nov 15, 2024

This clearly shows that dstack cannot connect to the instance using the provided key

@judeleonard Can you connect to the same host using the provided key via ssh -i <key path> jude@1.1.4.1?

@un-def Any ideas what could be wrong?

The current user I am using actually requires a password to successfully connect to the server. Could this be why?

@peterschmidt85
Copy link
Contributor

@judeleonard Yes! This certainly can be a reason.
Screenshot 2024-11-15 at 14 23 55

@judeleonard
Copy link
Author

Okay, let me work on this and try it again.

@peterschmidt85
Copy link
Contributor

peterschmidt85 commented Nov 15, 2024

@judeleonard Also ensure the SSH key is added to ~/.ssh/authorized_keys on the host?

@peterschmidt85
Copy link
Contributor

Basically, dstack works only if ssh works without a password.

@Fake45mar
Copy link

Fake45mar commented Dec 3, 2024

Hello!
I have an issue going on with dstack. To not create one more issue with the similar title will post it here.
Right now i am trying to create dstack server, fleet locally and test it, let's say with vscode task.
My ssh-fleet config:
.dstack.yaml:

type: fleet
# The name is optional, if not specified, generated randomly
name: fleet1

# Ensure instances are interconnected
placement: cluster

# The user, private SSH key, and hostnames of the on-prem servers
ssh_config:
  user: "user"
  identity_file: /home/user/.ssh/dstack
  hosts:
    - 127.0.0.1
  port: 2261

resources:
  cpu: 7
  memory: 10GB
  disk: 15GB
termination_idle_time: 2h

#fleet-ssh.dstack.yml

After applying with the command: "dstack apply .dstack.yaml" i have following:

Project        main         
 User           admin        
 Configuration  .dstack.yaml 
 Type           fleet        
 Fleet type     ssh          
 Nodes          1            
 Placement      cluster      

Fleet fleet1 does not exist yet.
Create the fleet? [y/n]: y

 FLEET   INSTANCE  BACKEND       RESOURCES                   PRICE  STATUS  CREATED    ERROR 
 fleet1  0         ssh (remote)  8xCPU, 39GB, 35.9GB (disk)  $0.0   idle    1 min ago

Yes, minor bug, it completely ignored declared resource request with CPU 7, 10GB RAM and 15GB Volume, anyway not the main concern yet.

Fleet is created, sudo is passwordless, docker is on the board: "Docker version 27.3.1, build ce12230".
Dedicated GPU is not presented in my laptop however integrated is available and detected according to this command:

sudo lshw -C display - 
 *-display                 
       description: VGA compatible controller
       product: TigerLake-LP GT2 [Iris Xe Graphics]
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       logical name: /dev/fb0
       version: 01
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm vga_controller bus_master cap_list rom fb
       configuration: depth=32 driver=i915 latency=0 mode=1920x1080 resolution=1920,1080 visual=truecolor xres=1920 yres=1080
       resources: iomemory:600-5ff iomemory:400-3ff irq:171 memory:601c000000-601cffffff memory:4000000000-400fffffff ioport:4000(size=64) memory:c0000-dffff memory:4010000000-4016ffffff memory:4020000000-40ffffffff

So, seems doable.

Then i ran one of examples i found in your docs section:

The name is optional, if not specified, generated randomly
name: vscode

python: "3.11"
# Uncomment to use a custom Docker image
#image: dstackai/base:py3.13-0.6-cuda-12.1

ide: vscode

# Use either spot or on-demand instances
#spot_policy: on-demand

 #Uncomment to request resources
resources:
  cpu: 2
  memory: 2GB
  disk: 5GB
  gpu: 0

#.dstack.yml```

This particular script detects resource reservation:
Project                main                          
 User                   admin                         
 Configuration          test.dstack.yml               
 Type                   dev-environment               
 Resources              2xCPU, 2GB, 0xGPU, 5GB (disk) 
 Max price              -                             
 Max duration           6h                            
 Spot policy            on-demand                     
 Retry policy           no                            
 Creation policy        reuse                         
 Termination policy     destroy-after-idle            
 Termination idle time  5m                            

Finished run vscode already exists.
Override the run? [y/n]: y
vscode provisioning completed (failed)
All provisioning attempts failed. This is likely due to cloud providers not having enough capacity. Check 
CLI and server logs for more details.

But unfortunately ends up as failed due to: FAILED_TO_START_DUE_TO_NO_CAPACITY
Here is the log from the dstack server -y
[15:44:20] INFO     Applying ~/.dstack/server/config.yml...                                                
           INFO     The admin token is 3a2a0bdd-e50e-4b6d-bcae-58497013dfd7                                
           INFO     The dstack server 0.18.28 is running at http://127.0.0.1:3000                          
[15:44:29] INFO     dstack._internal.server.services.fleets:368 Deleting fleets: ['fleet1']                
[15:44:30] INFO     dstack._internal.server.background.tasks.process_instances:780 Instance fleet1-0       
                    terminated                                                                             
[15:44:35] INFO     dstack._internal.server.background.tasks.process_fleets:72 Automatic cleanup of an     
                    empty fleet fleet1                                                                     
           INFO     dstack._internal.server.background.tasks.process_fleets:78 Fleet fleet1 deleted        
[15:44:39] INFO     dstack._internal.server.background.tasks.process_instances:216 Adding ssh instance     
                    fleet1-0...                                                                            
[15:44:43] INFO     dstack._internal.server.background.tasks.process_instances:356 Connected to            
                    lev.sliedniev 127.0.0.1                                                                
[15:45:29] INFO     dstack._internal.server.background.tasks.process_instances:274 The instance fleet1-0   
                    (127.0.0.1) was successfully added                                                     
[15:45:54] INFO     dstack._internal.server.background.tasks.process_runs:330 run(9a8b71)vscode: run status
                    has changed SUBMITTED -> TERMINATING                                                   
[15:45:57] INFO     dstack._internal.server.services.jobs:283 job(7ef866)vscode-0-0: job status is FAILED, 
                    reason: FAILED_TO_START_DUE_TO_NO_CAPACITY                                             
[15:45:58] INFO     dstack._internal.server.services.runs:952 run(9a8b71)vscode: run status has changed    
                    TERMINATING -> FAILED, reason: JOB_FAILED

Can you help me? Would be so grateful!

shim log:
2024/12/03 15:45:26 Downloading runner from https://dstack-runner-downloads.s3.eu-west-1.amazonaws.com/0.18.28/binaries/dstack-runner-linux-amd64
2024/12/03 15:45:28 The runner was downloaded successfully (18773619 bytes)
2024/12/03 15:45:28 Config Shim: {HTTPPort:10998 HomeDir:/root/.dstack}
2024/12/03 15:45:28 Config Runner: {HTTPPort:10999 LogLevel:6 DownloadURL:https://dstack-runner-downloads.s3.eu-west-1.amazonaws.com/0.18.28/binaries/dstack-runner-linux-amd64 BinaryPath:/tmp/dstack-runner2024387650 TempDir:/tmp/runner HomeDir:/root WorkingDir:/workflow}
2024/12/03 15:45:28 Config Docker: {SSHPort:10022 ConcatinatedPublicSSHKeys:ssh-rsa ~/.ssh/dstack
ssh-rsa:}
2024/12/03 15:54:00 Downloading runner from https://dstack-runner-downloads.s3.eu-west-1.amazonaws.com/0.18.28/binaries/dstack-runner-linux-amd64
2024/12/03 15:54:03 The runner was downloaded successfully (18773619 bytes)
2024/12/03 15:54:03 Config Shim: {HTTPPort:10998 HomeDir:/root/.dstack}
2024/12/03 15:54:03 Config Runner: {HTTPPort:10999 LogLevel:6 DownloadURL:https://dstack-runner-downloads.s3.eu-west-1.amazonaws.com/0.18.28/binaries/dstack-runner-linux-amd64 BinaryPath:/tmp/dstack-runner1393258449 TempDir:/tmp/runner HomeDir:/root WorkingDir:/workflow}
2024/12/03 15:54:03 Config Docker: {SSHPort:10022 ConcatinatedPublicSSHKeys:ssh-rsa key in ~/.ssh/dstack
ssh-rsa:}

I wiped out the keys from the shim.log

Thank you!

@jvstme
Copy link
Collaborator

jvstme commented Dec 3, 2024

Hi @Fake45mar. Looks like your dev-environment configuration is requesting exactly 2 CPUs, 2 GB RAM, and a 5GB disk.

resources:
  cpu: 2
  memory: 2GB
  disk: 5GB
  gpu: 0

However, the instance in your fleet1 fleet has 8 CPUs, 39 GB RAM, and a 35.9GB disk.

 FLEET   INSTANCE  BACKEND       RESOURCES                   PRICE  STATUS  CREATED    ERROR 
 fleet1  0         ssh (remote)  8xCPU, 39GB, 35.9GB (disk)  $0.0   idle    1 min ago

This instance does not match the resources in your dev-environment configuration. Since dstack can't find any instances that match, the run fails with FAILED_TO_START_DUE_TO_NO_CAPACITY.

Try removing resources from your dev-environment configuration or setting the requirements to open ranges:

resources:
  cpu: 2..
  memory: 2GB..
  disk: 5GB..
  gpu: 0

This notation means "2 CPUs or more", "2 GB or more", etc. You can find more examples of setting resources in the reference.

@Fake45mar
Copy link

Fake45mar commented Dec 3, 2024

@jvstme , hey! Would like to thank you for the quick reply. Yes, it stopped failing with the code i mentioded before. However, it doesn't wake up vscode as it was supposed:

dstack apply -R -f test.dstack.yml
 Project                main                                
 User                   admin                               
 Configuration          test.dstack.yml                     
 Type                   dev-environment                     
 Resources              2..xCPU, 2GB.., 0xGPU, 5GB.. (disk) 
 Max price              -                                   
 Max duration           6h                                  
 Spot policy            on-demand                           
 Retry policy           no                                  
 Creation policy        reuse                               
 Termination policy     destroy-after-idle                  
 Termination idle time  5m                                  

 #  BACKEND  REGION  INSTANCE  RESOURCES                   SPOT  PRICE       
 1  ssh      remote  instance  8xCPU, 39GB, 35.9GB (disk)  no    $0     idle 

Finished run vscode already exists.
Override the run? [y/n]: y
vscode provisioning completed (terminating)
Run failed with error code CONTAINER_EXITED_WITH_ERROR.
Error: time=2024-12-03T10:01:42.878879-05:00 level=trace msg=Starting API server port=10999
time=2024-12-03T10:01:42.879068-05:00 level=error msg=Server failed err=listen tcp :10999: bind: address 
already in use
time=2024-12-03T10:03:42.977802-05:00 level=error msg=Job didn't start in time, shutting down
Server failed:  no job
Check CLI, server, and run logs for more details.

what it makes even more strange, is that :10999 port is not actually busy. Should i continue posting my questions here or it's better to open a new thread?
In any case, thank you

@jvstme
Copy link
Collaborator

jvstme commented Dec 3, 2024

@Fake45mar, yes, could you please open another issue for this question or join our Discord server? We can answer your question there. Happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants