This integration collects telemetry from Databricks (including Spark on Databricks) and/or Spark telemetry from any Spark deployment. See the Features section for supported telemetry types.
- All references within this document to Databricks documentation reference the Databricks on AWS documentation. Use the cloud switcher menu located in the upper right hand corner of the documentation to select corresponding documentation for a different cloud.
- On-host deployment is currently the only supported deployment type. For Databricks and non-Databricks Spark deployments, the integration can be deployed on any supported host platform. For Databricks, support is also provided to deploy the integration on the driver node of a Databricks cluster using a cluster-scoped init script.
To get started with the New Relic Databricks integration, deploy the integration using a supported deployment type, configure the integration using supported configuration mechanisms, and then import the sample dashboard.
The New Relic Databricks integration can be run on any supported host platform. The integration will collect Databricks telemetry (including Spark on Databricks) via the Databricks ReST API using the Databricks SDK for Go and/or Spark telemetry from a non-Databricks Spark deployment via the Spark ReST API.
The New Relic Databricks integration can also be deployed on the driver node of a Databricks cluster using the provided init script to install and configure the integration at cluster startup time.
The New Relic Databricks integration provides binaries for the following host platforms.
- Linux amd64
- Windows amd64
To run the Databricks integration on a host, perform the following steps.
- Download the appropriate archive for your platform from the latest release.
- Extract the archive to a new or existing directory.
- Create a directory named
configs
in the same directory. - Create a file named
config.yml
in theconfigs
directory and copy the contents of the fileconfigs/config.template.yml
in this repository into it. - Edit the
config.yml
file to configure the integration appropriately for your environment. - From the directory where the archive was extracted, execute the integration
binary using the command
./newrelic-databricks-integration
(or.\newrelic-databricks-integration.exe
on Windows) with the appropriate Command Line Options.
The New Relic Databricks integration can be deployed on the driver node of a Databricks cluster using a cluster-scoped init script. The init script uses custom environment variables to specify configuration parameters necessary for the integration configuration.
To install the init script, perform the following steps.
- Login to your Databricks account and navigate to the desired workspace.
- Follow the recommendations for init scripts
to store the
cluster_init_integration.sh
script within your workspace in the recommended manner. For example, if your workspace is enabled for Unity Catalog, you should store the init script in a Unity Catalog volume. - Navigate to the
Compute
tab and select the desired all-purpose or job compute to open the compute details UI. - Click the button labeled
Edit
to edit the compute's configuration. - Follow the steps to use the UI to configure a cluster-scoped init script and point to the location where you stored the init script in step 2 above.
- If your cluster is not running, click on the button labeled
Confirm
to save your changes. Then, restart the cluster. If your cluster is already running, click on the button labeledConfirm and restart
to save your changes and restart the cluster.
Additionally, follow the steps to set environment variables to add the following environment variables.
NEW_RELIC_API_KEY
- Your New Relic User API KeyNEW_RELIC_LICENSE_KEY
- Your New Relic License KeyNEW_RELIC_ACCOUNT_ID
- Your New Relic Account IDNEW_RELIC_REGION
- The region of your New Relic account; one ofUS
orEU
NEW_RELIC_DATABRICKS_WORKSPACE_HOST
- The instance name of the target Databricks instanceNEW_RELIC_DATABRICKS_ACCESS_TOKEN
- To authenticate with a personal access token, your personal access tokenNEW_RELIC_DATABRICKS_OAUTH_CLIENT_ID
- To use a service principal to authenticate with Databricks (OAuth M2M), the OAuth client ID for the service principalNEW_RELIC_DATABRICKS_OAUTH_CLIENT_SECRET
- To use a service principal to authenticate with Databricks (OAuth M2M), an OAuth client secret associated with the service principal
Note that the NEW_RELIC_API_KEY
and NEW_RELIC_ACCOUNT_ID
are currently
unused but are required by the new-relic-client-go
module used by the integration. Additionally, note that only the personal access
token or OAuth credentials need to be specified but not both. If both are
specified, the OAuth credentials take precedence. Finally, make sure to restart
the cluster following the configuration of the environment variables.
The New Relic Databricks integration supports the following capabilities.
-
Collect Spark telemetry
The New Relic Databricks integration can collect telemetry from Spark running on Databricks. By default, the integration will automatically connect to and collect telemetry from the Spark deployments in all clusters created via the UI or API in the specified workspace.
The New Relic Databricks integration can also collect Spark telemetry from any non-Databricks Spark deployment.
-
Collect Databricks consumption and cost data
The New Relic Databricks integration can collect consumption and cost related data from the Databricks system tables. This data can be used to show Databricks DBU consumption metrics and estimated Databricks costs directly within New Relic.
Option | Description | Default |
---|---|---|
--config_path | path to the (#configyml) to use | configs/config.yml |
--dry_run | flag to enable "dry run" mode | false |
--env_prefix | prefix to use for environment variable lookup | '' |
--verbose | flag to enable "verbose" mode | false |
--version | display version information only | N/a |
The Databricks integration is configured using the config.yml
and/or environment variables. For Databricks, authentication related configuration
parameters may also be set in a Databricks configuration profile.
In all cases, where applicable, environment variables always take precedence.
All configuration parameters for the Databricks integration can be set using a
YAML file named config.yml
. The default location for this file
is configs/config.yml
relative to the current working directory when the
integration binary is executed. The supported configuration parameters are
listed below. See config.template.yml
for a full configuration example.
The parameters in this section are configured at the top level of the
config.yml
.
Description | Valid Values | Required | Default |
---|---|---|---|
New Relic license key | string | Y | N/a |
This parameter specifies the New Relic License Key (INGEST) that should be used to send generated metrics.
The license key can also be specified using the NEW_RELIC_LICENSE_KEY
environment variable.
Description | Valid Values | Required | Default |
---|---|---|---|
New Relic region identifier | US / EU |
N | US |
This parameter specifies which New Relic region that generated metrics should be sent to.
Description | Valid Values | Required | Default |
---|---|---|---|
Polling interval (in seconds) | numeric | N | 60 |
This parameter specifies the interval (in seconds) at which the integration should poll for data.
This parameter is only used when runAsService
is set to
true
.
Description | Valid Values | Required | Default |
---|---|---|---|
Flag to enable running the integration as a "service" | true / false |
N | false |
The integration can run either as a "service" or as a simple command line utility which runs once and exits when it is complete.
When set to true
, the integration process will run continuously and poll the
for data at the recurring interval specified by the interval
parameter. The process will only exit if it is explicitly stopped or a fatal
error or panic occurs.
When set to false
, the integration will run once and exit. This is intended for
use with an external scheduling mechanism like cron.
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for the set of pipeline configuration parameters | YAML Mapping | N | N/a |
The integration retrieves, processes, and exports data to New Relic using a data pipeline consisting of one or more receivers, a processing chain, and a New Relic exporter. Various aspects of the pipeline are configurable. This element groups together the configuration parameters related to pipeline configuration.
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for the set of log configuration parameters | YAML Mapping | N | N/a |
The integration uses the logrus package for application logging. This element groups together the configuration parameters related to log configuration.
Description | Valid Values | Required | Default |
---|---|---|---|
The integration execution mode | databricks |
N | databricks |
The integration execution mode. Currently, the only supported execution mode is
databricks
.
Deprecated: As of v2.3.0, this configuration parameter is no longer used.
The presence (or not) of the databricks
top-level node will be
used to enable (or disable) the Databricks collector. Likewise, the presence
(or not) of the spark
top-level node will be used to enable (or
disable) the Spark collector separate from Databricks.
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for the set of Databricks configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector. If this element is not specified, the Databricks collector will not be run.
Note that this node is not required. It can be used with or without the
spark
top-level node.
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for the set of Spark configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Spark collector. If this element is not specified, the Spark collector will not be run.
Note that this node is not required. It can be used with or without the
databricks
top-level node.
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for a set of custom tags to add to all telemetry sent to New Relic | YAML Mapping | N | N/a |
This element specifies a group of custom tags that will be added to all telemetry sent to New Relic. The tags are specified as a set of key-value pairs.
Description | Valid Values | Required | Default |
---|---|---|---|
Size of the buffer that holds items before processing | number | N | 500 |
This parameter specifies the size of the buffer that holds received items before being flushed through the processing chain and on to the exporters. When this size is reached, the items in the buffer will be flushed automatically.
Description | Valid Values | Required | Default |
---|---|---|---|
Harvest interval (in seconds) | number | N | 60 |
This parameter specifies the interval (in seconds) at which the pipeline
should automatically flush received items through the processing chain and on
to the exporters. Each time this interval is reached, the pipeline will flush
items even if the item buffer has not reached the size specified by the
receiveBufferSize
parameter.
Description | Valid Values | Required | Default |
---|---|---|---|
Number of concurrent pipeline instances to run | number | N | 3 |
The integration retrieves, processes, and exports metrics to New Relic using
a data pipeline consisting of one or more receivers, a processing chain, and a
New Relic exporter. When runAsService
is true
, the
integration can launch one or more "instances" of this pipeline to receive,
process, and export data concurrently. Each "instance" will be configured with
the same processing chain and exporters and the receivers will be spread across
the available instances in a round-robin fashion.
This parameter specifies the number of pipeline instances to launch.
NOTE: When runAsService
is false
, only a single
pipeline instance is used.
Description | Valid Values | Required | Default |
---|---|---|---|
Log level | panic / fatal / error / warn / info / debug / trace |
N | warn |
This parameter specifies the maximum severity of log messages to output with
trace
being the least severe and panic
being the most severe. For example,
at the default log level (warn
), all log messages with severities warn
,
error
, fatal
, and panic
will be output but info
, debug
, and trace
will not.
Description | Valid Values | Required | Default |
---|---|---|---|
Path to a file where log output will be written | string | N | stderr |
This parameter designates a file path where log output should be written. When
no path is specified, log output will be written to the standard error stream
(stderr
).
The Databricks configuration parameters are used to configure the Databricks collector.
Description | Valid Values | Required | Default |
---|---|---|---|
Databricks workspace instance name | string | conditional | N/a |
This parameter specifies the instance name
of the target Databricks instance for which data should be collected. This is
used by the integration when constructing the URLs for API calls. Note that the
value of this parameter must not include the https://
prefix, e.g.
https://my-databricks-instance-name.cloud.databricks.com
.
This parameter is required when the collection of Spark telemetry for Spark running on Databricks is enabled. Note that this does not apply when the integration is deployed directly on the driver node via the provided init script. This parameter is unused in that scenario.
The workspace host can also be specified using the DATABRICKS_HOST
environment variable.
NOTE: The DATABRICKS_HOST
environment variable can not be used to specify
both the instance name
and the accounts API endpoint. To account for this, the environment variables
DATABRICKS_WORKSPACEHOST
and DATABRICKS_ACCOUNTHOST
environment variables
can be alternately used either separately or in combination with the
DATABRICKS_HOST
environment variable to specify the
instance name
and the accounts API endpoint, respectively.
Description | Valid Values | Required | Default |
---|---|---|---|
Databricks accounts API endpoint | string | conditional | N/a |
This parameter specifies the accounts API endpoint. This is
used by the integration when constructing the URLs for account-level
ReST API calls. Note
that unlike the value of workspaceHost
, the value of this
parameter must include the https://
prefix, e.g.
https://accounts.cloud.databricks.com
.
This parameter is required when the collection of Databricks consumption and cost data data is enabled.
The account host can also be specified using the DATABRICKS_HOST
environment variable.
NOTE: The DATABRICKS_HOST
environment variable can not be used to specify
both the instance name
and the accounts API endpoint. To account for this, the environment variables
DATABRICKS_WORKSPACEHOST
and DATABRICKS_ACCOUNTHOST
environment variables
can be alternately used either separately or in combination with the
DATABRICKS_HOST
environment variable to specify the
instance name
and the accounts API endpoint, respectively.
Description | Valid Values | Required | Default |
---|---|---|---|
Databricks account ID for the accounts API | string | conditional | N/a |
This parameter specifies the Databricks account ID. This is used by the integration when constructing the URLs for account-level ReST API calls.
This parameter is required when the collection of Databricks consumption and cost data data is enabled.
Description | Valid Values | Required | Default |
---|---|---|---|
Databricks personal access token | string | N | N/a |
When set, the integration will use Databricks personal access token authentication to authenticate Databricks API calls with the value of this parameter as the Databricks personal access token.
The personal access token can also be specified using the DATABRICKS_TOKEN
environment variable or any other SDK-supported mechanism (e.g. the token
field in a Databricks configuration profile).
See the authentication section for more details.
NOTE: Databricks personal access tokens can only be used to collect data at the workspace level. They can not be used to collect account-level data using the account-level ReST APIs. To collect account level data such as consumption and cost data, OAuth authentication must be used instead.
Description | Valid Values | Required | Default |
---|---|---|---|
Databricks OAuth M2M client ID | string | N | N/a |
When set, the integration will use a service principal to authenticate with Databricks (OAuth M2M) when making Databricks API calls. The value of this parameter will be used as the OAuth client ID.
The OAuth client ID can also be specified using the DATABRICKS_CLIENT_ID
environment variable or any other SDK-supported mechanism (e.g. the client_id
field in a Databricks configuration profile).
See the authentication section for more details.
Description | Valid Values | Required | Default |
---|---|---|---|
Databricks OAuth M2M client secret | string | N | N/a |
When the oauthClientId
is set, this parameter can be set to
specify the OAuth secret
associated with the service principal.
The OAuth client secret can also be specified using the
DATABRICKS_CLIENT_SECRET
environment variable or any other SDK-supported
mechanism (e.g. the client_secret
field in a Databricks
configuration profile).
See the authentication section for more details.
Description | Valid Values | Required | Default |
---|---|---|---|
Flag to enable automatic collection of Spark metrics | true / false |
N | true |
Deprecated This configuration parameter has been deprecated in favor of the
configuration parameter databricks.spark.enabled
.
Use that parameter instead.
Description | Valid Values | Required | Default |
---|---|---|---|
A prefix to prepend to Spark metric names | string | N | N/a |
Deprecated This configuration parameter has been deprecated in favor of the
configuration parameter databricks.spark.metricPrefix
.
Use that parameter instead.
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for the Databricks cluster source configuration | YAML Mapping | N | N/a |
Deprecated This configuration parameter has been deprecated in favor of the
configuration parameter databricks.spark.clusterSources
.
Use that parameter instead.
Description | Valid Values | Required | Default |
---|---|---|---|
Timeout (in seconds) to use when executing SQL statements on a SQL warehouse | number | N | 30 |
Certain telemetry and data collected by the Databricks collector requires the collector to run Databricks SQL statements on a SQL warehouse. This configuration parameter specifies the number of seconds to wait before timing out a pending or running SQL query.
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for the set of Databricks Spark configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure
the Databricks collector settings related to the collection of telemetry from
Databricks running on Spark. The configuration parameters in this group replace
the configuration parameters sparkMetrics
,
sparkMetricPrefix
, and sparkClusterSources
.
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for the set of Databricks Usage configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector settings related to the collection of consumption and cost data.
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for the set of Databricks Job configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector settings related to the collection of job data.
Description | Valid Values | Required | Default |
---|---|---|---|
Flag to enable automatic collection of Spark metrics | true / false |
N | true |
By default, when the Databricks collector is enabled, it will automatically collect Spark telemetry from Spark running on Databricks.
This flag can be used to disable the collection of Spark telemetry by the Databricks collector. This may be useful to control data ingest when business requirements call for the collection of non-Spark related Databricks telemetry and Spark telemetry is not used. This flag is also used by the integration when it is deployed directly on the driver node of a Databricks cluster using the the provided init script since Spark telemetry is collected by the Spark collector in this scenario.
NOTE: This configuration parameter replaces the older sparkMetrics
configuration parameter.
Description | Valid Values | Required | Default |
---|---|---|---|
A prefix to prepend to Spark metric names | string | N | N/a |
This parameter serves the same purpose as the metricPrefix
parameter of the Spark configuration except that it
applies to Spark telemetry collected by the Databricks collector. See the
metricPrefix
parameter of the Spark configuration
for more details.
Note that this parameter has no effect on Spark telemetry collected by the Spark collector. This includes the case when the integration is deployed directly on the driver node of a Databricks cluster using the the provided init script since Spark telemetry is collected by the Spark collector in this scenario.
NOTE: This configuration parameter replaces the older sparkMetricPrefix
configuration parameter.
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for the Databricks cluster source configuration | YAML Mapping | N | N/a |
The mechanism used to create a cluster is referred to as a cluster "source". The Databricks collector supports collecting Spark telemetry from all-purpose clusters created via the UI or API and from job clusters created via the Databricks Jobs Scheduler. This element groups together the flags used to individually enable or disable the cluster sources from which the Databricks collector will collect Spark telemetry.
NOTE: This configuration parameter replaces the older sparkClusterSources
configuration parameter.
Description | Valid Values | Required | Default |
---|---|---|---|
Flag to enable automatic collection of Spark telemetry from all-purpose clusters created via the UI | true / false |
N | true |
By default, when the Databricks collector is enabled, it will automatically collect Spark telemetry from all all-purpose clusters created via the UI.
This flag can be used to disable the collection of Spark telemetry from all-purpose clusters created via the UI.
Description | Valid Values | Required | Default |
---|---|---|---|
Flag to enable automatic collection of Spark telemetry from job clusters created via the Databricks Jobs Scheduler | true / false |
N | true |
By default, when the Databricks collector is enabled, it will automatically collect Spark telemetry from job clusters created by the Databricks Jobs Scheduler.
This flag can be used to disable the collection of Spark telemetry from job clusters created via the Databricks Jobs Scheduler.
Description | Valid Values | Required | Default |
---|---|---|---|
Flag to enable automatic collection of Spark telemetry from all-purpose clusters created via the Databricks ReST API | true / false |
N | true |
By default, when the Databricks collector is enabled, it will automatically collect Spark telemetry from all-purpose clusters created via the Databricks ReST API.
This flag can be used to disable the collection of Spark telemetry from all-purpose clusters created via the Databricks ReST API.
The Databricks usage configuration parameters are used to configure Databricks collector settings related to the collection of Databricks consumption and cost data.
Description | Valid Values | Required | Default |
---|---|---|---|
Flag to enable automatic collection of consumption and cost data | true / false |
N | true |
By default, when the Databricks collector is enabled, it will automatically collect consumption and cost data.
This flag can be used to disable the collection of consumption and cost data by the Databricks collector. This may be useful when running multiple instances of the New Relic Databricks integration. In this scenario, Databricks consumption and cost data collection should only be enabled on a single instance. Otherwise, this data will be recorded more than once in New Relic, affecting consumption and cost calculations.
Description | Valid Values | Required | Default |
---|---|---|---|
ID of a SQL warehouse on which to run usage-related SQL statements | string | Y | N/a |
The ID of a SQL warehouse on which to run the SQL statements used to collect Databricks consumption and cost data.
This parameter is required when the collection of Databricks consumption and cost data is enabled.
Description | Valid Values | Required | Default |
---|---|---|---|
Flag to enable inclusion of identity related metadata in consumption and cost data | true / false |
N | false |
When the collection of Databricks consumption and cost data is enabled, the Databricks collector can include several pieces of identifying information along with the consumption and cost data.
By default, when the collection of Databricks consumption and cost data is enabled, the Databricks collector will not collect such data as it may be personally identifiable. This flag can be used to enable the inclusion of the identifying information.
When enabled, the following values are included.
- The identity of the user a serverless billing record is attributed to. This value is included in the identity metadata returned from usage records in the billable usage system table.
- The identity of the cluster creator for each usage record for billable usage attributed to all-purpose and job compute.
- The single user name for each usage record for billable usage attributed to all-purpose and job compute configured for single-user access mode.
- The identity of the warehouse creator for each usage record for billable usage attributed to SQL warehouse compute.
- The identity of the user or service principal used to run jobs for each query result collected by job cost data queries.
Description | Valid Values | Required | Default |
---|---|---|---|
Time of day (as HH:mm:ss ) at which to run usage data collection |
string with format HH:mm:ss |
N | 02:00:00 |
This parameter specifies the time of day at which the collection of
consumption and cost data occur. The value must be of
the form HH:mm:ss
where HH
is the 0
-padded 24-hour clock hour
(00
- 23
), mm
is the 0
-padded minute (00
- 59
) and ss
is the
0
-padded second (00
- 59
). For example, 09:00:00
is the time 9:00 AM and
23:30:00
is the time 11:30 PM.
The time will always be interpreted according to the UTC time zone. The time
zone can not be configured. For example, to specify that the integration should
be run at 2:00 AM EST (-0500), the value 07:00:00
should be specified.
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for the set of flags used to selectively enable or disable optional usage queries | YAML Mapping | N | N/a |
When the collection of Databricks consumption and cost data
is enabled, the Databricks collector will always
collect billable usage data and list pricing data
on every run. In addition, by default, the Databricks collector will also run
all job cost queries on every run. However, the latter behavior
can be configured using a set of flags specified with this configuration
property to selectively enable or disable the job cost queries.
Each flag is specified using a property with the query ID as the name of the
property and true
or false
as the value of the property.The following flags
are supported.
jobs_cost_list_cost_per_job_run
jobs_cost_list_cost_per_job
jobs_cost_frequent_failures
jobs_cost_most_retries
For example, to enable the list cost per job run query and the list cost per job query but disable the list cost of failed job runs for jobs with frequent failures query and the list cost of repaired job runs for jobs with frequent repairs query, the following configuration would be specified.
optionalQueries:
jobs_cost_list_cost_per_job_run: true
jobs_cost_list_cost_per_job: true
jobs_cost_frequent_failures: false
jobs_cost_most_retries: false
Description | Valid Values | Required | Default |
---|---|---|---|
The root node for the set of Databricks Job Run configuration parameters | YAML Mapping | N | N/a |
This element groups together the configuration parameters to configure the Databricks collector settings related to the collection of job run data.
The Databricks job run configuration parameters are used to configure Databricks collector settings related to the collection of Databricks job run data.
Description | Valid Values | Required | Default |
---|---|---|---|
Flag to enable automatic collection of job run data | true / false |
N | true |
By default, when the Databricks collector is enabled, it will automatically collect job run data.
This flag can be used to disable the collection of job run data by the Databricks collector. This may be useful when running multiple instances of the New Relic Databricks integration against the same Databricks workspace. In this scenario, Databricks job run data collection should only be enabled on a single instance of the integration. Otherwise, this data will be recorded more than once in New Relic, affecting product features that use job run metrics (e.g. dashboards and alerts).
Description | Valid Values | Required | Default |
---|---|---|---|
A prefix to prepend to Databricks job run metric names | string | N | N/a |
This parameter specifies a prefix that will be prepended to each Databricks job run metric name when the metric is exported to New Relic.
For example, if this parameter is set to databricks.
, then the full name of
the metric representing the duration of a job run (job.run.duration
) will be
databricks.job.run.duration
.
Note that it is not recommended to leave this value empty as the metric names without a prefix may be ambiguous.
Description | Valid Values | Required | Default |
---|---|---|---|
Flag to enable inclusion of the job run ID in the databricksJobRunId attribute on all job run metrics |
true / false |
N | false |
By default, the Databricks collector will not include job run IDs on any of the job run metrics in order to avoid possible violations of metric cardinality limits due to the fact that job run IDs have high cardinality because they are unique across all jobs and job runs.
This flag can be used to enable the inclusion of the job run ID in the
databricksJobRunId
attribute on all job metrics.
When enabled, use the Limits UI and/or create a dashboard in order to monitor your limit status. Additionally, set alerts on resource metrics to provide updates on limits changes.
Description | Valid Values | Required | Default |
---|---|---|---|
Offset (in seconds) from the current time to use for calculating the earliest job run start time to match when listing job runs | number | N | 86400 (1 day) |
This parameter specifies an offset, in seconds that can be used to tune the collector's performance by limiting the number of job runs to return by constraining the start time of job runs to match to be greather than a particular time in the past calculated as an offset from the current time.
See the section startOffset
Configuration for
more details.
The Spark configuration parameters are used to configure the Spark collector.
Description | Valid Values | Required | Default |
---|---|---|---|
The Web UI URL of an application on the Spark deployment to monitor | string | N | N/a |
This parameter can be used to monitor a non-Databricks Spark deployment. It
specifes the URL of the Web UI
of an application running on the Spark deployment to monitor. The value should
be of the form http[s]://<hostname>:<port>
where <hostname>
is the hostname
of the Spark deployment to monitor and <port>
is the port number of the
Spark application's Web UI (typically 4040 or 4041, 4042, etc if more than one
application is running on the same host).
Note that the value must not contain a path. The path of the Spark ReST API
endpoints (mounted at /api/v1
) will automatically be prepended.
Description | Valid Values | Required | Default |
---|---|---|---|
A prefix to prepend to Spark metric names | string | N | N/a |
This parameter specifies a prefix that will be prepended to each Spark metric name when the metric is exported to New Relic.
For example, if this parameter is set to spark.
, then the full name of the
metric representing the value of the memory used on application executors
(app.executor.memoryUsed
) will be spark.app.executor.memoryUsed
.
Note that it is not recommended to leave this value empty as the metric names
without a prefix may be ambiguous. Additionally, note that this parameter has no
effect on Spark telemetry collected by the Databricks collector. In that case,
use the sparkMetricPrefix
instead.
The Databricks integration uses the Databricks SDK for Go to access the Databricks and Spark ReST APIs. The SDK performs authentication on behalf of the integration and provides many options for configuring the authentication type and credentials to be used. See the SDK documentation and the Databricks client unified authentication documentation for details.
For convenience purposes, the following parameters can be used in the Databricks configuration section of the `config.yml file.
-
accessToken
- When set, the integration will instruct the SDK to explicitly use Databricks personal access token authentication. The SDK will not attempt to try other authentication mechanisms and instead will fail immediately if personal access token authentication fails.NOTE: Databricks personal access tokens can only be used to collect data at the workspace level. They can not be used to collect account-level data using the account-level ReST APIs. To collect account level data such as consumption and cost data, OAuth authentication must be used instead.
-
oauthClientId
- When set, the integration will instruct the SDK to explicitly use a service principal to authenticate with Databricks (OAuth M2M). The SDK will not attempt to try other authentication mechanisms and instead will fail immediately if OAuth M2M authentication fails. The OAuth Client secret can be set using theoauthClientSecret
configuration parameter or any of the other mechanisms supported by the SDK (e.g. theclient_secret
field in a Databricks configuration profile or theDATABRICKS_CLIENT_SECRET
environment variable). -
oauthClientSecret
- The OAuth client secret to use for OAuth M2M authentication. This value is only used whenoauthClientId
is set in theconfig.yml
. The OAuth client secret can also be set using any of the other mechanisms supported by the SDK (e.g. theclient_secret
field in a Databricks configuration profile or theDATABRICKS_CLIENT_SECRET
environment variable).
The New Relic Databricks integration can collect Databricks consumption and cost related data from the Databricks system tables. This data can be used to show Databricks DBU consumption metrics and estimated Databricks costs directly within New Relic.
This feature is enabled by setting the Databricks usage enabled
flag to true
in the integration configuration. When enabled,
the Databricks collector will collect consumption and cost related data once a
day at the time specified in the runTime
configuration parameter.
The following information is collected on each run.
- Billable usage records from the
system.billing.usage
table - List pricing records from the
system.billing.list_prices
table - List cost per job run of jobs run on jobs compute and serverless compute
- List cost per job of jobs run on jobs compute and serverless compute
- List cost of failed job runs for jobs with frequent failures run on jobs compute and serverless compute
- List cost of repair job runs for jobs with frequent repairs run on jobs compute and serverless compute
NOTE: Job cost data from jobs run on workspaces outside the region of the
workspace containing the SQL Warehouse used to collect the consumption data
(specified in the workspaceHost
configuration parameter)
will not be returned by the queries used by the Databricks integration to
collect Job cost data.
In order for the New Relic Databricks integration to collect consumption and cost related data from Databricks, there are several requirements.
- OAuth authentication must be used. This is required even when the integration is deployed on the driver node of a Databricks cluster using the provided init script because the integration leverages account-level ReST APIs when collecting consumption and cost related data. These APIs can only be accessed when OAuth authentication is used.
- An account ID and account host must be provided since the account-level ReST APIs require them.
- The SQL warehouse ID of a SQL warehouse within the workspace associated with the configured workspace host must be specified. The Databricks SQL queries used to collect consumption and cost related data from the Databricks system tables. will be run against the specified SQL warehouse.
- As of v2.6.0, the Databricks integration leverages data that is stored in the
system.lakeflow.jobs
table. As of 10/24/2024, this table is in public preview. To access this table, thelakeflow
schema must be enabled in yoursystem
catalog. To enable thelakeflow
schema, follow the instructions in the Databricks documentation to enable a system schema using the metastore ID of the Unity Catalog metastore attached to the workspace associated with the configured workspace host and the schema namelakeflow
.
On each run, billable usage data is collected from the system.billing.usage
table
for the previous day. For each billable usage record, a corresponding record
is created within New Relic as a New Relic event
with the event type DatabricksUsage
and the following attributes.
NOTE: Not every attribute is included in every event. For example, the
cluster_*
attributes are only included in events for usage records relevant to
all-purpose or job related compute. Similarly, the warehouse_*
attributes are
only included in events for usage records relevant to SQL warehouse related
compute.
NOTE: Descriptions below are sourced from the billable usage system table reference.
Name | Description |
---|---|
account_id |
ID of the account this usage record was generated for |
workspace_id |
ID of the Workspace this usage record was associated with |
workspace_name |
Name of the Workspace this usage record was associated with |
record_id |
Unique ID for this usage record |
sku_name |
Name of the SKU associated with this usage record |
cloud |
Name of the Cloud this usage record is relevant for |
usage_start_time |
The start time relevant to this usage record |
usage_end_time |
The end time relevant to this usage record |
usage_date |
Date of this usage record |
usage_unit |
Unit this usage record is measured in |
usage_quantity |
Number of units consumed for this usage record |
record_type |
Whether the usage record is original, a retraction, or a restatement. See the section "Analyze Correction Records" in the Databricks documentation for more details. |
ingestion_date |
Date the usage record was ingested into the usage table |
billing_origin_product |
The product that originated this usage reocrd |
usage_type |
The type of usage attributed to the product or workload for billing purposes |
cluster_id |
ID of the cluster associated with this usage record |
cluster_creator |
Creator of the cluster associated with this usage record (only included if includeIdentityMetadata is true ) |
cluster_single_user_name |
Single user name of the cluster associated with this usage record if the access mode of the cluster is single-user access mode (only included if includeIdentityMetadata is true ) |
cluster_source |
Cluster source of the cluster associated with this usage record |
cluster_instance_pool_id |
Instance pool ID of the cluster associated with this usage record |
warehouse_id |
ID of the SQL warehouse associated with this usage record |
warehouse_name |
Name of the SQL warehouse associated with this usage record |
warehouse_creator |
Creator of the SQL warehouse associated with this usage record (only included if includeIdentityMetadata is true ) |
instance_pool_id |
ID of the instance pool associated with this usage record |
node_type |
The instance type of the compute resource associated with this usage record |
job_id |
ID of the job associated with this usage record for serverless compute or jobs compute usage |
job_run_id |
ID of the job run associated with this usage record for serverless compute or jobs compute usage |
job_name |
Name of the job associated with this usage record for serverless compute or jobs compute usage. NOTE: This field will only contain a value for jobs run within a workspace in the same cloud region as the workspace containing the SQL warehouse used to collect the consumption and cost data (specified in workspaceHost ). |
serverless_job_name |
User-given name of the job associated with this usage record for jobs run on serverless compute only |
notebook_id |
ID of the notebook associated with this usage record for serverless compute for notebook usage |
notebook_path |
Workspace storage path of the notebook associated with this usage for serverless compute for notebook usage |
dlt_pipeline_id |
ID of the Delta Live Tables pipeline associated with this usage record |
dlt_update_id |
ID of the Delta Live Tables pipeline update associated with this usage record |
dlt_maintenance_id |
ID of the Delta Live Tables pipeline maintenance tasks associated with this usage record |
run_name |
Unique user-facing identifier of the Mosaic AI Model Training fine-tuning run associated with this usage record |
endpoint_name |
Name of the model serving endpoint or vector search endpoint associated with this usage record |
endpoint_id |
ID of the model serving endpoint or vector search endpoint associated with this usage record |
central_clean_room_id |
ID of the central clean room associated with this usage record |
run_as |
See the section "Analyze Identity Metadata" in the Databricks documentation for more details (only included if includeIdentityMetadata is true ) |
jobs_tier |
Jobs tier product features for this usage record: values include LIGHT , CLASSIC , or null |
sql_tier |
SQL tier product features for this usage record: values include CLASSIC , PRO , or null |
dlt_tier |
DLT tier product features for this usage record: values include CORE , PRO , ADVANCED , or null |
is_serverless |
Flag indicating if this usage record is associated with serverless usage: values include true or false , or null |
is_photon |
Flag indicating if this usage record is associated with Photon usage: values include true or false , or null |
serving_type |
Serving type associated with this usage record: values include MODEL , GPU_MODEL , FOUNDATION_MODEL , FEATURE , or null |
In addition, all compute resource tags, jobs tags, and budget policy tags
applied to this usage are added as event attributes. For jobs referenced in the
job_id
attribute of JOBS
usage records, custom job tags are also included if
the job was run within a workspace in the same cloud region as the workspace
containing the SQL warehouse used to collect the consumption data (specified in
the workspaceHost
configuration parameter).
On each run, list pricing data is collected from the system.billing.list_prices
table
and used to populate a New Relic lookup table
named DatabricksListPrices
. The entire lookup table is updated on each run.
For each pricing record, this table will contain a corresponding row with the
following columns.
NOTE: Descriptions below are sourced from the pricing system table reference.
Name | Description |
---|---|
account_id |
ID of the account this pricing record was generated for |
price_start_time |
The time the price in this pricing record became effective in UTC |
price_end_time |
The time the price in this pricing record stopped being effective in UTC |
sku_name |
Name of the SKU associated with this pricing record |
cloud |
Name of the Cloud this pricing record is relevant for |
currency_code |
The currency the price in this pricing record is expressed in |
usage_unit |
The unit of measurement that is monetized in this pricing record |
list_price |
A single price that can be used for simple long-term estimates |
Note that the list_price
field contains a single price suitable for simple
long-term estimates. It does not reflect any promotional pricing or custom price
plans.
On each run, the Databricks collector runs a set of queries that leverage data
in the system.billing.usage
table,
the system.billing.list_prices
table,
and the system.lakeflow.jobs
table
to collect job cost related data for the previous day. The set of queries that
are run collect the following data.
- List cost per job run of jobs run on jobs compute and serverless compute
- List cost per job of jobs run on jobs compute and serverless compute
- List cost of failed job runs for jobs with frequent failures run on jobs compute and serverless compute
- List cost of repaired job runs for jobs with frequent repairs run on jobs compute and serverless compute
By default, the Databricks integration runs each query on every run. This
behavior can be configured using the optionalQueries
configuration parameter to selectively enable or disable queries by query ID.
For each query that is run, a New Relic event
with the event type DatabricksJobCost
is created for each result row with one
attribute for each column of data. In addition, each event includes a query_id
attribute that contains the unique ID of the query that produced the data stored
in the event. This ID can be used in NRQL
queries to scope the query to the appropriate data set.
See the following sub-sections for more details on the data sets collected by each query.
NOTE: Please see the note at the end of the Consumption & Cost Data section regarding job cost data and the last requirement in the Consumption and Cost Requirements section for important considerations for collecting job cost data.
Job cost data on the list cost per job run is collected using the query with the
query id jobs_cost_list_cost_per_job_run
. This query produces
DatabricksJobCost
events with the following attributes.
Name | Description |
---|---|
workspace_id |
ID of the Workspace where the job run referenced in the run_id attribute occurred |
workspace_name |
Name of the Workspace where the job run referenced in the run_id attribute occurred |
job_id |
The unique job ID of the job associated with this job run |
job_name |
The user-given name of the job referenced in the job_id attribute |
run_id |
The unique run ID of this job run |
run_as |
The ID of the user or service principal used for the job run (only included if includeIdentityMetadata is true ) |
list_cost |
The estimated list cost of this job run |
last_seen_date |
The last time a billable usage record was seen referencing the job ID referenced in the job_id attribute and the run ID referenced in the run_id attribute, in UTC |
Job cost data on the list cost per job is collected using the query with the
query id jobs_cost_list_cost_per_job
. This query produces DatabricksJobCost
events with the following attributes.
Name | Description |
---|---|
workspace_id |
ID of the Workspace containing the job referenced in the job_id attribute |
workspace_name |
Name of the Workspace containing the job referenced in the job_id attribute |
job_id |
The unique job ID of the job |
job_name |
The user-given name of the job referenced in the job_id attribute |
runs |
The number of job runs seen for the day this result was collected for the job referenced in the job_id attribute |
run_as |
The ID of the user or service principal used to run the job (only included if includeIdentityMetadata is true ) |
list_cost |
The estimated list cost of all runs for the job referenced in the job_id attribute for the day this result was collected |
last_seen_date |
The last time a billable usage record was seen referencing the job ID referenced in the job_id attribute, in UTC |
Job cost data on the list cost of failed job runs for jobs with frequent
failures is collected using the query with the query id
jobs_cost_frequent_failures
. This query produces DatabricksJobCost
events
with the following attributes.
Name | Description |
---|---|
workspace_id |
ID of the Workspace containing the job referenced in the job_id attribute |
workspace_name |
Name of the Workspace containing the job referenced in the job_id attribute |
job_id |
The unique job ID of the job |
job_name |
The user-given name of the job referenced in the job_id attribute |
runs |
The number of job runs seen for the day this result was collected for the job referenced in the job_id attribute |
run_as |
The ID of the user or service principal used to run the job (only included if includeIdentityMetadata is true ) |
failures |
The number of job runs seen with the result state ERROR , FAILED , or TIMED_OUT for the day this result was collected for the job referenced in the job_id attribute |
list_cost |
The estimated list cost of all failed job runs seen for the day this result was collected for the job referenced in the job_id attribute |
last_seen_date |
The last time a billable usage record was seen referencing the job ID referenced in the job_id attribute, in UTC |
Job cost data on the list cost of repaired job runs for jobs with frequent
repairs is collected using the query with the query id jobs_cost_most_retries
.
This query produces DatabricksJobCost
events with the following attributes.
Name | Description |
---|---|
workspace_id |
ID of the Workspace where the job run referenced in the run_id attribute occurred |
workspace_name |
Name of the Workspace where the job run referenced in the run_id attribute occurred |
job_id |
The unique job ID of the job associated with this job run |
job_name |
The user-given name of the job referenced in the job_id attribute |
run_id |
The unique run ID of this job run |
run_as |
The ID of the user or service principal used to run the job (only included if includeIdentityMetadata is true ) |
repairs |
The number of repair runs seen for the day this result was collected for the job run referenced in the run_id attribute where the result of the final repair run was SUCCEEDED |
list_cost |
The estimated list cost of the repair runs seen for the day this result was collected for the job run referenced in the run_id attribute |
repair_time_seconds |
The cumulative duration of the repair runs seen for the day this result was collected for the job run referenced in the run_id attribute |
The New Relic Databricks integration can collect telemetry about
Databricks Job
runs, such as job run durations, task run durations, the current state of job
and task runs, if a job or a task is a retry, and the number of times a task was
retried. This feature is enabled by default and can be enabled or disabled using
the Databricks jobs enabled
flag in the
integration configuration.
NOTE: Some of the text below is sourced from the Databricks SDK Go module documentation.
On each run, the integration uses the Databricks ReST API
to retrieve job run data. By default, the Databricks ReST API
endpoint for listing job runs
returns a paginated list of all historical job runs sorted in descending order
by start time. On systems with many jobs, there may be a large number of job
runs to retrieve and process, impacting the performance of the collector. To
account for this, the integration provides the startOffset
configuration parameter. This parameter is used to tune the performance of the
integration by constraining the start time of job runs to match to be greater
than a particular time in the past calculated as an offset from the current time
and thus, limiting the number of job runs to return.
The effect of this behavior is that only job runs which have a start time at
or after the calculated start time will be returned on the API call. For
example, using the default startOffset
(86400 seconds or 24
hours), only job runs which started within the last 24 hours will be returned.
This means that jobs that started more than 24 hours ago will not be returned
even if some of those jobs are not yet in a TERMINATED
state.
Therefore, it is important to carefully select a value for the startOffset
parameter that will account for long-running job runs without degrading the
performance of the integration.
Job run data is sent to New Relic as dimensional metrics. The following metrics and attributes (dimensions) are provided.
Metric Name | Metric Type | Description |
---|---|---|
job.runs |
gauge | Job run counts per state |
job.tasks |
gauge | Task run counts per state |
job.run.duration |
gauge | Duration (in milliseconds) of the job run |
job.run.duration.queue |
gauge | Duration (in milliseconds) the job run spent in the queue |
job.run.duration.execution |
gauge | Duration (in milliseconds) the job run was actually executing commands (only available for single-task job runs) |
job.run.duration.setup |
gauge | Duration (in milliseconds) it took to setup the cluster (only available for single-task job runs) |
job.run.duration.cleanup |
gauge | Duration (in milliseconds) it took to terminate the cluster and cleanup associated artifacts (only available for single-task job runs) |
job.run.task.duration |
gauge | Duration (in milliseconds) of the task run |
job.run.task.duration.queue |
gauge | Duration (in milliseconds) the task run spent in the queue |
job.run.task.duration.execution |
gauge | Duration (in milliseconds) the task run was actually executing commands |
job.run.task.duration.setup |
gauge | Duration (in milliseconds) it took to setup the cluster |
job.run.task.duration.cleanup |
gauge | Duration (in milliseconds) it took to terminate the cluster and cleanup associated artifacts |
Attribute Name | Data Type | Description |
---|---|---|
databricksJobId |
number | The unique numeric ID of the job being run |
databricksJobRunId |
number | The unique numeric ID of the job run (only included if includeRunId is set to true ) |
databricksJobRunName |
string | The optional name of the job run |
databricksJobRunAttemptNumber |
number | The sequence number of the run attempt for this job run (0 for the original attempt or if the job has no retry policy, greater than 0 for subsequent attempts for jobs with a retry policy) |
databricksJobRunState |
string | The state of the job run |
databricksJobRunIsRetry |
boolean | true if the job run is a retry of an earlier failed attempt, otherwise false |
databricksJobRunTerminationCode |
string | For terminated jobs, the job run termination code |
databricksJobRunTerminationType |
string | For terminated jobs, the job run termination type |
databricksJobRunTaskName |
string | The unique name of the task within it's parent job |
databricksJobRunTaskState |
string | The state of the task run |
databricksJobRunTaskAttemptNumber |
number | The sequence number of the run attempt for this task run (0 for the original attempt or if the task has no retry policy, greater than 0 for subsequent attempts for tasks with a retry policy) |
databricksJobRunTaskIsRetry |
boolean | true if the task run is a retry of an earlier failed attempt, otherwise false |
databricksJobRunTaskTerminationCode |
string | For terminated tasks, the task run termination code |
databricksJobRunTaskTerminationType |
string | For terminated tasks, the task run termination type |
Databricks job and task runs can be in one of 6 states.
BLOCKED
- run is blocked on an upstream dependencyPENDING
- run is waiting to be executed while the cluster and execution context are being preparedQUEUED
- run is queued due to concurrency limitsRUNNING
- run is executingTERMINATING
- run has completed, and the cluster and execution context are being cleaned upTERMINATED
- run has completed
The job and run task states are recorded in the databricksJobRunState
and
databricksJobRunTaskState
, respectively. The databricksJobRunState
is on
every job run metric including the task related metrics. The
databricksJobRunTaskState
is only on job run metrics
related to tasks, e.g. job.run.task.duration
.
On each run, the integration records the number of job and task runs in each
state (with the exception of runs in the
TERMINATED
state (see below)), in the metrics job.runs
and job.tasks
,
respectively, using the attributes databricksJobRunState
and
databricksJobRunTaskState
, to indicate the state.
In general, only the latest()
aggregator function should be used when visualizing or alerting on counts
including these states. For example, to display the
number of jobs by state, use the following NRQL statement.
FROM Metric
SELECT latest(databricks.job.runs)
FACET databricksJobRunState
The count of job and task runs in the TERMINATED
state are also recorded but
only include job and task runs that have terminated since the last run of the
integration. This is done to avoid counting terminated job and task runs more
than once, making it straightforward to use aggregator functions
to visualize or alert on values such as "number of job runs completed per time
period", "average duration of job runs", and "average duration job runs spent in
the queue". For example, to show the average time job runs spend in the queue,
grouped by job name, use the following NRQL statement.
FROM Metric
SELECT average(databricks.job.run.duration.queue / 1000)
WHERE databricksJobRunState = 'TERMINATED'
FACET databricksJobRunName
LIMIT MAX
NOTE: Make sure to use the condition databricksJobRunState = 'TERMINATED'
or databricksJobRunTaskState = 'TERMINATED'
when visualizing or alerting on
values using these aggregator functions.
The Databricks integration stores job and task run durations in the metrics
job.run.duration
and job.run.task.duration
, respectively. The values stored
in these metrics are calculated differently, depending on the state of the job
or task, as follows.
-
While a job or task run is running (not
BLOCKED
orTERMINATED
), the run duration stored in the respective metrics is the "wall clock" time of the job or task run, calculated as the current time when the metric is collected by the integration minus thestart_time
of the job or task run as returned from the Databricks ReST API. This duration is inclusive of any time spent in the queue and any setup time, execution time, and cleanup time that has been spent up to the time the duration is calculated but does not include any time the job or task run was blocked.This duration is also cumulative, meaning that while the job run is running, the value calculated each time the integration runs will include the entire "wall clock" time since the reported
start_time
. It is essentially acumulativeCount
metric but stored as agauge
. Within NRQL statements, the latest() function can be used to get the most recent duration. For example, to show the most recent duration of job runs that are running grouped by job name, use the following NRQL (assuming the job run metricPrefix isdatabricks.
).FROM Metric SELECT latest(databricks.job.run.duration) WHERE databricksJobRunState != 'TERMINATED' FACET databricksJobRunName
-
Once a job or task run has been terminated, the run duration stored in the respective metrics are not calculated but instead come directly from the values returned in the
listruns
API (also see the SDK documentation forBaseRun
andRunTask
). For both job and task runs, the run durations are determined as follows.- For job runs of multi-task jobs, the
job.run.duration
for a job run is set to therun_duration
field of the corresponding job run item in theruns
field returned from thelistruns
API. - For job runs of single-task jobs, the
job.run.duration
for a job run is set to the sum of thesetup_duration
, theexecution_duration
and thecleanup_duration
fields of the corresponding job run item in theruns
field returned from thelistruns
API. - For task runs, the
job.run.task.duration
for a task run is set to the sum of thesetup_duration
, theexecution_duration
and thecleanup_duration
fields of the corresponding task run item in thetasks
field of the corresponding job run item in theruns
field returned from thelistruns
API.
- For job runs of multi-task jobs, the
In addition to the job.run.duration
and job.run.task.duration
metrics, once
a job or task has terminated, the job.run.duration.queue
and
job.run.task.duration.queue
metrics are set to the queue_duration
field of
the corresponding job run item in the runs
field returned from the
listruns
API for
job runs, and the queue_duration
field of the corresponding task run item in
the tasks
field of the corresponding job run item in the runs
field from the
listruns
API for
task runs.
Finally, for job runs and task runs of single-task jobs, the following metrics are set.
- The
job.run.duration.setup
andjob.run.task.duration.setup
metrics are set to thesetup_duration
field of the corresponding job run item in theruns
field returned from thelistruns
API for job runs, and thesetup_duration
field of the corresponding task run item in thetasks
field of the corresponding job run item in theruns
field from thelistruns
API for task runs. - The
job.run.duration.execution
andjob.run.task.duration.execution
are set to theexecution_duration
field of the corresponding job run item in theruns
field returned from thelistruns
API for job runs, and theexecution_duration
field of the corresponding task run item in thetasks
field of the corresponding job run item in theruns
field from thelistruns
API for task runs. - The
job.run.duration.cleanup
andjob.run.task.duration.cleanup
are set to thecleanup_duration
field of the corresponding job run item in theruns
field returned from thelistruns
API for job runs, and thecleanup_duration
field of the corresponding task run item in thetasks
field of the corresponding job run item in theruns
field from thelistruns
API for task runs.
All examples below assume the job run metricPrefix
is databricks.
.
Current job run counts by state
FROM Metric
SELECT latest(databricks.job.runs)
FACET databricksJobRunState
Current task run counts by state
FROM Metric
SELECT latest(databricks.job.tasks)
FACET databricksJobRunTaskState
Job run durations by job name over time
FROM Metric
SELECT latest(databricks.job.run.duration)
FACET databricksJobRunName
TIMESERIES
Average job run duration by job name
FROM Metric
SELECT average(databricks.job.run.duration)
WHERE databricksJobRunState = 'TERMINATED'
FACET databricksJobRunName
NOTE: Make sure to use the condition databricksJobRunState = 'TERMINATED'
or databricksJobRunTaskState = 'TERMINATED'
when visualizing or alerting on
values using aggregator functions
other than latest.
Average task run queued duration by task name over time
FROM Metric
SELECT average(databricks.job.run.task.duration.queue)
WHERE databricksJobRunTaskState = 'TERMINATED'
FACET databricksJobRunTaskName
TIMESERIES
NOTE: Make sure to use the condition databricksJobRunState = 'TERMINATED'
or databricksJobRunTaskState = 'TERMINATED'
when visualizing or alerting on
values using aggregator functions
other than latest.
A sample dashboard is included that shows examples of the types of job run information that can be displayed and the NRQL statements to use to visualize the data.
While not strictly enforced, the basic preferred editor settings are set in the .editorconfig. Other than this, no style guidelines are currently imposed.
This project uses both go vet
and
staticcheck
to perform static code analysis. These
checks are run via precommit
on all commits. Though
this can be bypassed on local commit, both tasks are also run during
the validate
workflow and must have no
errors in order to be merged.
Commit messages must follow the conventional commit format.
Again, while this can be bypassed on local commit, it is strictly enforced in
the validate
workflow.
The basic commit message structure is as follows.
<type>[optional scope][!]: <description>
[optional body]
[optional footer(s)]
In addition to providing consistency, the commit message is used by
svu during
the release workflow. The presence and values
of certain elements within the commit message affect auto-versioning. For
example, the feat
type will bump the minor version. Therefore, it is important
to use the guidelines below and carefully consider the content of the commit
message.
Please use one of the types below.
feat
(bumps minor version)fix
(bumps patch version)chore
build
docs
test
Any type can be followed by the !
character to indicate a breaking change.
Additionally, any commit that has the text BREAKING CHANGE:
in the footer will
indicate a breaking change.
For local development, simply use go build
and go run
. For example,
go build cmd/databricks/databricks.go
Or
go run cmd/databricks/databricks.go
If you prefer, you can also use goreleaser
with
the --single-target
option to build the binary for the local GOOS
and
GOARCH
only.
goreleaser build --single-target
Releases are built and packaged using goreleaser
.
By default, a new release will be built automatically on any push to the main
branch. For more details, review the .goreleaser.yaml
and the goreleaser
documentation.
The svu utility is used to generate the next tag value based on commit messages.
This project utilizes GitHub workflows to perform actions in response to certain GitHub events.
Workflow | Events | Description |
---|---|---|
validate | push , pull_request to main branch |
Runs precommit to perform static analysis and runs commitlint to validate the last commit message |
build | push , pull_request |
Builds and tests code |
release | push to main branch |
Generates a new tag using svu and runs goreleaser |
repolinter | pull_request |
Enforces repository content guidelines |
The sections below cover topics that are related to Databricks telemetry but that are not specifically part of this integration. In particular, any assets referenced in these sections must be installed and/or managed separately from the integration. For example, the init scripts provided to monitor cluster health are not automatically installed or used by the integration.
New Relic Infrastructure can be used to collect system metrics like CPU and memory usage from the nodes in a Databricks cluster. Additionally, New Relic APM can be used to collect application metrics like JVM heap size and GC cycle count from the Apache Spark driver and executor JVMs. Both are achieved using cluster-scoped init scripts. The sections below cover the installation of these init scripts.
NOTE: Use of one or both init scripts will have a slight impact on cluster startup time. Therefore, consideration should be given when using the init scripts with a job cluster, particularly when using a job cluster scoped to a single task.
Both the New Relic Infrastructure Agent init script
and the New Relic APM Java Agent init script
require a New Relic license key
to be specified in a custom environment variable
named NEW_RELIC_LICENSE_KEY
. While the license key can be specified by
hard-coding it in plain text in the compute configuration, this is not
recommended. Instead, it is recommended to create a secret.
using the Databricks CLI
and reference the secret in the environment variable.
To create the secret and set the environment variable, perform the following steps.
-
Follow the steps to install or update the Databricks CLI.
-
Use the Databricks CLI to create a Databricks-backed secret scope with the name
newrelic
. For example,databricks secrets create-scope newrelic
NOTE: Be sure to take note of the information in the referenced URL about the
MANAGE
scope permission and use the correct version of the command. -
Use the Databricks CLI to create a secret for the license key in the new scope with the name
licenseKey
. For example,databricks secrets put-secret --json '{ "scope": "newrelic", "key": "licenseKey", "string_value": "[YOUR_LICENSE_KEY]" }'
To set the custom environment variable named NEW_RELIC_LICENSE_KEY
and
reference the value from the secret, follow the steps to
configure custom environment variables
and add the following line after the last entry in the Environment variables
field.
NEW_RELIC_LICENSE_KEY={{secrets/newrelic/licenseKey}}
The cluster_init_infra.sh
script
automatically installs the latest version of the
New Relic Infrastructure Agent
on each node of the cluster.
To install the init script, perform the following steps.
- Login to your Databricks account and navigate to the desired workspace.
- Follow the recommendations for init scripts
to store the
cluster_init_infra.sh
script within your workspace in the recommended manner. For example, if your workspace is enabled for Unity Catalog, you should store the init script in a Unity Catalog volume. - Navigate to the
Compute
tab and select the desired all-purpose or job compute to open the compute details UI. - Click the button labeled
Edit
to edit the compute's configuration. - Follow the steps to use the UI to configure a cluster to run an init script and point to the location where you stored the init script in step 2.
- If your cluster is not running, click on the button labeled
Confirm
to save your changes. Then, restart the cluster. If your cluster is already running, click on the button labeledConfirm and restart
to save your changes and restart the cluster.
The cluster_init_apm.sh
script
automatically installs the latest version of the
New Relic APM Java Agent
on each node of the cluster.
To install the init script, perform the same steps as outlined in the
Install the New Relic Infrastructure Agent init script
section using the cluster_init_apm.sh
script
instead of the cluster_init_infra.sh
script.
Additionally, perform the following steps.
-
Login to your Databricks account and navigate to the desired workspace.
-
Navigate to the
Compute
tab and select the desired all-purpose or job compute to open the compute details UI. -
Click the button labeled
Edit
to edit the compute's configuration. -
Follow the steps to configure custom Spark configuration properties and add the following lines after the last entry in the
Spark Config
field.spark.driver.extraJavaOptions -javaagent:/databricks/jars/newrelic-agent.jar spark.executor.extraJavaOptions -javaagent:/databricks/jars/newrelic-agent.jar -Dnewrelic.tempdir=/tmp
-
If your cluster is not running, click on the button labeled
Confirm
to save your changes. Then, restart the cluster. If your cluster is already running, click on the button labeledConfirm and restart
to save your changes and restart the cluster.
With the New Relic Infrastructure Agent init script installed, a host entity will show up for each node in the cluster.
With the New Relic APM Java Agent init script installed, an APM application
entity named Databricks Driver
will show up for the Spark driver JVM and an
APM application entity named Databricks Executor
will show up for the
executor JVMs. Note that all executor JVMs will report to a single APM
application entity. Metrics for a specific executor can be viewed on many pages
of the APM UI
by selecting the instance from the Instances
menu located below the time range
selector. On the JVM Metrics page,
the JVM metrics for a specific executor can be viewed by selecting an instance
from the JVM instances
table.
Additionally, both the host entities and the APM entities are tagged with the tags listed below to make it easy to filter down to the entities that make up your cluster using the entity filter bar that is available in many places in the UI.
databricksClusterId
- The ID of the Databricks clusterdatabricksClusterName
- The name of the Databricks clusterdatabricksIsDriverNode
-true
if the entity is on the driver node, otherwisefalse
databricksIsJobCluster
-true
if the entity is part of a job cluster, otherwisefalse
Below is an example of using the databricksClusterName
to filter down to the
host and APM entities for a single cluster using the entity filter bar
on the All entities
view.
New Relic has open-sourced this project. This project is provided AS-IS WITHOUT WARRANTY OR DEDICATED SUPPORT. Issues and contributions should be reported to the project here on GitHub.
We encourage you to bring your experiences and questions to the Explorers Hub where our community members collaborate on solutions and new ideas.
At New Relic we take your privacy and the security of your information seriously, and are committed to protecting your information. We must emphasize the importance of not sharing personal data in public forums, and ask all users to scrub logs and diagnostic information for sensitive information, whether personal, proprietary, or otherwise.
We define “Personal Data” as any information relating to an identified or identifiable individual, including, for example, your name, phone number, post code or zip code, Device ID, IP address, and email address.
For more information, review New Relic’s General Data Privacy Notice.
We encourage your contributions to improve this project! Keep in mind that when you submit your pull request, you'll need to sign the CLA via the click-through using CLA-Assistant. You only have to sign the CLA one time per project.
If you have any questions, or to execute our corporate CLA (which is required if your contribution is on behalf of a company), drop us an email at opensource@newrelic.com.
A note about vulnerabilities
As noted in our security policy, New Relic is committed to the privacy and security of our customers and their data. We believe that providing coordinated disclosure by security researchers and engaging with the security community are important means to achieve our security goals.
If you believe you have found a security vulnerability in this project or any of New Relic's products or websites, we welcome and greatly appreciate you reporting it to New Relic through HackerOne.
If you would like to contribute to this project, review these guidelines.
To all contributors, we thank you! Without your contribution, this project would not be what it is today.
The New Relic Databricks Integration project is licensed under the Apache 2.0 License.