Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark failed to start: Driver unresponsive using vanilla Databricks cluster #3582

Closed
tljjvogten opened this issue Jun 20, 2023 · 13 comments · Fixed by #3583
Closed

Spark failed to start: Driver unresponsive using vanilla Databricks cluster #3582

tljjvogten opened this issue Jun 20, 2023 · 13 comments · Fixed by #3583
Assignees
Labels
bug Something isn't working

Comments

@tljjvogten
Copy link

Description

In my Azure TRE deployment I am trying to start a cluster in Databricks. I've added all the Databricks services and they are working fine. However, I get the error shown below. Am I missing something?

Spark failed to start: Driver unresponsive. Possible reasons: library conflicts, incorrect metastore configuration, and init script misconfiguration.

Steps

The steps I have tried are:

  1. Vanilla cluster with no init scipts, libraries or custom metastores
  2. Different machine types for the cluster
  3. Checked for any quota issues which there were none
@tljjvogten tljjvogten added the question Further information is requested label Jun 20, 2023
@marrobi
Copy link
Member

marrobi commented Jun 20, 2023

Which Azure region are you using?

Can you check the Azure firewall logs for any traffic being denied from the subnet containing Databricks?

Also in the cluster logs, anything of note?

Ill try test it out this end, but might be a couple of days.

@tljjvogten
Copy link
Author

I'm using region West Europe. I'll dive into the logs. Thank you.

@tljjvogten
Copy link
Author

Only error in the cluster logs is the one mentioned.

@marrobi
Copy link
Member

marrobi commented Jun 20, 2023

OK, if you can check the firewall logs. My guess is a file is trying to be downloaded but is blocked by the firewall - maybe a new FQDN dependancy. It's a couple of months since I've personally tested this.

If nothing obvious it might be worth opening a support ticket via the Azure portal for Databricks who might be able to provide some guidance.

@tljjvogten
Copy link
Author

I'll check. But progress is slow. Running from one error into another.

@marrobi
Copy link
Member

marrobi commented Jun 21, 2023

Just spinning up a cluster. Wil let you know how I get on.

@tljjvogten
Copy link
Author

tljjvogten commented Jun 21, 2023 via email

@marrobi
Copy link
Member

marrobi commented Jun 21, 2023

In the firewall logs I can see:

HTTPS request from 10.1.5.132:39766 to stgdbfssvc4c74.dfs.core.windows.net:443. Action: Deny. No rule matched. Proceeding with default action
HTTPS request from 10.1.5.134:59862 to md-hdd-kjwr2pc2wdgv.z29.blob.storage.azure.net:443. Action: Deny. No rule matched. Proceeding with default action
HTTPS request from 10.1.5.132:38192 to umsarnnfvjl3bbcrdndk.blob.core.windows.net:443. Action: Deny. No rule matched. Proceeding with default action

Will check the rule collection versus published rules. We recently added a note here - https://microsoft.github.io/AzureTRE/unreleased/tre-templates/workspace-services/databricks/


This service uses a JSON file to store the various network endpoints required by Databricks to function.

If you hit networking related issues when deploying or using Databricks, please ensure this file https://github.com/microsoft/AzureTRE/blob/main/templates/workspace_services/databricks/terraform/databricks-udr.json contains the approprate settings for the region you are using.

The required settings for each region can be extracted from this document: https://learn.microsoft.com/azure/databricks/resources/supported-regions.

@marrobi
Copy link
Member

marrobi commented Jun 21, 2023

Versus the JSON file see some difference, might be recent changes or errors when JSON file has been created.

metastore:
consolidated-westeurope-prod-metastore.mysql.database.azure.com has been added

artifact:
dbartifactsprodwesteu1.blob.core.windows.net has gone
arprodwesteua1.blob.core.windows.net has been added

Added them, now left with stgdbfssvc4c74.dfs.core.windows.net, this should be a private endpoint connection, but is not. We have one for blob, but not DFS. This is likely linked to the changes to ADLS from Blob in the Azure Databricks service.

Will look to PR a fix.

@marrobi
Copy link
Member

marrobi commented Jun 21, 2023

image

Nice breaking change from the Azure Databricks service.

@marrobi
Copy link
Member

marrobi commented Jun 21, 2023

@tljjvogten If you can test the changes in the PR would be appreciated. Works for me.

Easiest way is to create a copy of the Databricks workspace service from the PR branch into a /templates/workspace_services/databricks directory and run make workspace_service_bundle BUNDLE=databricks .

Thanks for reporting this.

@adbrom
Copy link

adbrom commented Jun 22, 2023

@marrobi we applied the fix and managed to get a Databricks cluster up and running.

Thanks for your quick response!

@marrobi marrobi added bug Something isn't working and removed question Further information is requested labels Jun 27, 2023
@marrobi marrobi moved this to Up Next in Azure TRE - Engineering Jun 27, 2023
@marrobi marrobi moved this from Up Next to In Progress in Azure TRE - Engineering Jun 27, 2023
@marrobi marrobi moved this from In Progress to PR in Azure TRE - Engineering Jun 27, 2023
@tamirkamara tamirkamara assigned sachinkundu and unassigned marrobi Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants