Skip to content

Commit

Permalink
add README on GCP workload monitoring feature
Browse files Browse the repository at this point in the history
  • Loading branch information
jcyang43 committed Jan 8, 2025
1 parent cf65bcb commit caef832
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 2 deletions.
4 changes: 2 additions & 2 deletions MaxText/configs/base.yml
Original file line number Diff line number Diff line change
Expand Up @@ -460,8 +460,8 @@ goodput_upload_interval_seconds: 60
enable_pathways_goodput: False

# GCP workload monitoring
report_heartbeat_metric_for_gcp_monitoring: False
report_performance_metric_for_gcp_monitoring: False
report_heartbeat_metric_for_gcp_monitoring: True
report_performance_metric_for_gcp_monitoring: True

# Vertex AI Tensorboard Configurations - https://github.com/google/maxtext/tree/main/getting_started/Use_Vertex_AI_Tensorboard.md
# Set to True for GCE, False if running via XPK
Expand Down
19 changes: 19 additions & 0 deletions getting_started/GCP_Workload_Monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Enable GCP Workload Monitoring
This guide provides an overview on how to enable GCP workload monitoring for your MaxText workload.

## Overview
Google offers a monitoring and alerting feature that is well suited for critical MaxText workloads sensitive to infrastructure changes.
Once enabled, metrics will be automatically sent to [Cloud Monarch](https://research.google/pubs/monarch-googles-planet-scale-in-memory-time-series-database/) for monitoring.
If a metric hits its pre-defined threshold, the Google Cloud on-call team will be alerted to see if any action is needed.

The feature currently supports heartbeat and performance (training step time in seconds) metrics. In the near future, support for the goodput metric will also be added.
Users should work with their Customer Engineer (CE) and the Google team to define appropriate thresholds for the performance metrics.

This guide layouts how to enable the feature for your MaxText workload.

## Enabling GCP Workload Monitoring
User can control which metric they want to report via config:
- To report the heartbeat metric, set `report_heartbeat_metric_for_gcp_monitoring` to `True`
- To report the performance metric (training step time in seconds), set `report_performance_metric_for_gcp_monitoring` to `True`

For an example, please refer to [base.yml](../MaxText/configs/base.yml).

0 comments on commit caef832

Please sign in to comment.