Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/revamped status #462

Merged
merged 12 commits into from
Feb 14, 2024
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- This has options to filter by return code, task queues, task statuses, and workers
- You can set a limit on the number of tasks to display
- There are 3 options to modify the output display
- Docs for all of the monitoring commands
- New file `merlin/study/status.py` dedicated to work relating to the status command
- Contains the Status and DetailedStatus classes
- New file `merlin/study/status_renderers.py` dedicated to formatting the output for the detailed-status command
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ HPC batch systems, since it can scale to a very large number of jobs.

The integrated system looks a little something like this:

<img src="docs/images/merlin_arch.png" alt="a typical Merlin workflow">
![A Typical Merlin Workflow](docs/assets/images/merlin_arch.png)

In this example, here's how it all works:

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
bgunnar5 marked this conversation as resolved.
Show resolved Hide resolved
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
245 changes: 234 additions & 11 deletions docs/user_guide/command_line.md
Original file line number Diff line number Diff line change
Expand Up @@ -539,18 +539,133 @@ merlin stop-workers [OPTIONS]

## Monitoring Commands

The Merlin library comes equipped with commands to help monitor your workflow:
The Merlin library comes equipped with several commands to help monitor your workflow:

- *[detailed-status](#detailed-status-merlin-detailed-status)*: Display task-by-task status information for a study
- *[monitor](#monitor-merlin-monitor)*: Keep your allocation alive while tasks are being processed
- *[query-workers](#query-workers-merlin-query-workers)*: Communicate with Celery to view information on active workers
- *[status](#status-merlin-status)*: Communicate with Celery to view the status of queues in your workflow(s)
- *[queue-info](#queue-info-merlin-queue-info)*: Communicate with Celery to view the status of queues in your workflow(s)
- *[status](#status-merlin-status)*: Display a summary of the status of a study

More information on all of these commands can be found below and in the [Monitoring documentation](./monitoring/index.md).

### Detailed Status (`merlin detailed-status`)

!!! warning

For the pager opened by this command to work properly the `MANPAGER` or `PAGER` environment variable must be set to `less -r`. This can be set with:

=== "MANPAGER"

```bash
export MANPAGER="less -r"
```

=== "PAGER"

```bash
export PAGER="less -r"
```

Display the task-by-task status of a workflow.

This command will open a pager window with task statuses. Inside this pager window, you can search and scroll through task statuses for every step of your workflow.

For more information, see the [Detailed Status documentation](./monitoring/status_cmds.md#the-detailed-status-command).

**Usage:**

```bash
merlin detailed-status [OPTIONS] WORKSPACE_OR_SPECIFICATION
```

**Options:**

| Name | Type | Description | Default |
| ------------ | ------- | ----------- | ------- |
| `-h`, `--help` | boolean | Show this help message and exit | `False` |
| `--dump` | filename | The name of a csv or json file to dump the status to | None |
| `--task_server` | string | Task server type. Currently only "celery" is implemented. | "celery" |
| `-o`, `--output-path` | dirname | Specify a location to look for output workspaces. Only used when a spec file is passed as the argument to `status`. | None |

**Filter Options:**

The `detailed-status` command comes equipped with several options to help filter the output of your status query.

| Name | Type | Description | Default |
| ------------ | ------- | ----------- | ------- |
| `--max-tasks` | integer | Sets a limit on how many tasks can be displayed. | None |
| `--return-code` | List[string] | Filter which tasks to display based on their return code. Multiple return codes can be provided using a space-delimited list. Options: `SUCCESS`, `SOFT_FAIL`, `HARD_FAIL`, `STOP_WORKERS`, `RETRY`, `DRY_SUCCESS`, `UNRECOGNIZED`. | None |
| `--steps` | List[string] | Filter which tasks to display based on the steps that they're associated with. Multiple steps can be provided using a space-delimited list. | `['all']` |
| `--task-queues` | List[string] | Filter which tasks to display based on a the task queues that they were/are in. Multiple task queues can be provided using a space-delimited list. | None |
| `--task-status` | List[string] | Filter which tasks to display based on their status. Multiple statuses can be provided using a space-delimited list. Options: `INITIALIZED`, `RUNNING`, `FINISHED`, `FAILED`, `CANCELLED`, `DRY_RUN`, `UNKNOWN`. | None |
| `--workers` | List[string] | Filter which tasks to display based on which workers are processing them. Multiple workers can be provided using a space-delimited list. | None |

**Display Options:**

There are multiple options to modify the way task statuses are displayed.

| Name | Type | Description | Default |
| ------------ | ------- | ----------- | ------- |
| `--disable-pager` | boolean | Turn off the pager functionality when viewing the task-by-task status. **Caution:** This option is *not* recommended for large workflows as you could freeze your terminal with thousands of task statuses. | `False` |
| `--disable-theme` | boolean | Turn off styling for the status layout. | `False` |
| `--layout` | string | Alternate task-by-task status display layouts. Options: `table`, `default`. | `default` |
| `--no-prompts` | boolean | Ignore any prompts provided. This cause the `detailed-status` command to default to the latest study if you provide a spec file as input. | `False` |

**Examples:**

!!! example "Check the Detailed Status Using Workspace as Input"

```bash
merlin detailed-status study_name_20240129-123452/
```

!!! example "Check the Detailed Status Using a Specification as Input"

This will look in the `OUTPUT_PATH` [Reserved Variable](./variables.md#reserved-variables) defined within the spec file to try to find existing workspace directories associated with this spec file. If more than one are found, a prompt will be displayed for you to select a workspace directory.

```bash
merlin detailed-status my_specification.yaml
```

!!! example "Dump the Status Report to a JSON File"

```bash
merlin detailed-status study_name_20240129-123452/ --dump status_report.json
```

!!! example "Only Display Failed Tasks"

```bash
merlin detailed-status study_name_20240129-123452/ --task-status FAILED
```

!!! example "Display the First 8 Successful Tasks"

```bash
merlin detailed-status study_name_20240129-123452/ --return-code SUCCESS --max-tasks 8
```

!!! example "Disable the Theme"

```bash
merlin detailed-status study_name_20240129-123452/ --disable-theme
```

!!! example "Use the Table Layout"

```bash
merlin detailed-status study_name_20240129-123452/ --layout table
```

### Monitor (`merlin monitor`)

Batch submission scripts may not keep the batch allocation alive if there is not a blocking process in the submission script. The `merlin monitor` command addresses this by providing a blocking process that checks for tasks in the queues every (sleep) seconds ("sleep" here can be defined with the `--sleep` option). When the queues are empty, the monitor will query Celery to see if any workers are still processing tasks from the queues. If no workers are processing any tasks from the queues and the queues are empty, the blocking process will exit and allow the allocation to end.

The `monitor` functionality will check for Celery workers for up to 10*(sleep) seconds before monitoring begins. The loop happens when the queue(s) in the spec contain tasks, but no running workers are detected. This is to protect against a failed worker launch.

For more information, see the [Monitoring Studies for Persistent Allocations documentation](./monitoring/monitor_for_allocation.md).

**Usage:**

```bash
Expand Down Expand Up @@ -591,6 +706,8 @@ Check which workers are currently connected to the task server.

This will broadcast a command to all connected workers and print the names of any that respond and the queues they're attached to. This is useful for interacting with workers, such as via `merlin stop-workers --workers`.

For more information, see the [Query Workers documentation](./monitoring/queues_and_workers.md#query-workers).

**Usage:**

```bash
Expand Down Expand Up @@ -643,42 +760,148 @@ merlin query-workers [OPTIONS]
merlin query-workers --workers ^step
```

### Status (`merlin status`)
### Queue Info (`merlin queue-info`)

!!! note

Prior to Merlin v1.12.0 the `merlin status` command would produce the same output as `merlin queue-info --spec <spec_file>`

Check the status of queues to see if there are any tasks in them and/or any workers watching them.

Check the status of the queues in your spec file to see if there are any tasks in them and any active workers watching them.
If used without the `--spec` option, this will query any active queues. Active queues are queues that have a worker watching them.

For more information, see the [Queue Information documentation](./monitoring/queues_and_workers.md#queue-information).

**Usage:**

```bash
merlin status [OPTIONS] SPECIFICATION
merlin queue-info [OPTIONS]
```

**Options:**

| Name | Type | Description | Default |
| ------------ | ------- | ----------- | ------- |
| `-h`, `--help` | boolean | Show this help message and exit | `False` |
| `--dump` | filename | The name of a csv or json file to dump the queue information to | None |
| `--specific-queues` | List[string] | A space-delimited list of queues to get information on | None |
| `--task_server` | string | Task server type. Currently only "celery" is implemented. | "celery" |

**Specification Options:**

These options all *must* be used with the `--spec` option if used.

| Name | Type | Description | Default |
| ------------ | ------- | ----------- | ------- |
| `--spec` | filename | Query for the queues named in each step of the spec file given here | None |
| `--steps` | List[string] | A space-delimited list of steps in the input spec that you want to query. Should be given after the input spec. | `['all']` |
| `--vars` | List[string] | A space-delimited list of variables to override in the spec file. This list should be given after the spec file is provided. Ex: `--vars QUEUE_NAME=new_queue_name` | None |

**Examples:**

!!! example "Query All Active Queues"

```bash
merlin queue-info
```

!!! example "Check the Status of Specific Queues"

```bash
merlin queue-info --specific-queues queue_1 queue_3
```

!!! example "Check the Status of Queues in a Spec File"

**This is the same as running `merlin status <spec_file>` prior to Merlin v1.12.0**

```bash
merlin queue-info --spec my_specification.yaml
```

!!! example "Check the Status of Queues for Specific Steps"

```bash
merlin queue-info --spec my_specification.yaml --steps step_1 step_3
```

!!! example "Dump the Queue Information to a JSON File"

```bash
merlin queue-info --dump queue_report.json
```

### Status (`merlin status`)

!!! note

To obtain the same functionality as the `merlin status` command prior to Merlin v1.12.0 use [`merlin queue-info`](#queue-info-merlin-queue-info) with the `--spec` option:

```bash
merlin queue-info --spec <spec_file>
```

Display a high-level status summary of a workflow.

This will display the progress of each step in your workflow using progress bars and brief summaries. In each summary you can find how many tasks there are in total for a step, how many tasks are in each state, the average run time and standard deviation of run times of the tasks in the step, the task queue, and the worker that is watching the step.

For more information, see the [Status documentation](./monitoring/status_cmds.md#the-status-command).

**Usage:**

```bash
merlin status [OPTIONS] WORKSPACE_OR_SPECIFICATION
```

**Options:**

| Name | Type | Description | Default |
| ------------ | ------- | ----------- | ------- |
| `-h`, `--help` | boolean | Show this help message and exit | `False` |
| `--cb-help` | boolean | Colorblind help option. This will utilize different symbols for each state of a task. | `False` |
| `--dump` | filename | The name of a csv or json file to dump the status to | None |
| `--no-prompts` | boolean | Ignore any prompts provided to the command line. This will default to the latest study if you provide a spec file rather than a study workspace. | `False` |
| `--task_server` | string | Task server type. Currently only "celery" is implemented. | "celery" |
| `--csv` | filename | The name of a csv file to dump the queue status report to | None |
| `-o`, `--output-path` | dirname | Specify a location to look for output workspaces. Only used when a spec file is passed as the argument to `status`. | None |

**Examples:**

!!! example "Basic Status Check"
!!! example "Check the Status Using Workspace as Input"

```bash
merlin status study_name_20240129-123452/
```

!!! example "Check the Status Using a Specification as Input"

This will look in the `OUTPUT_PATH` [Reserved Variable](./variables.md#reserved-variables) defined within the spec file to try to find existing workspace directories associated with this spec file. If more than one are found, a prompt will be displayed for you to select a workspace directory.

```bash
merlin status my_specification.yaml
```

!!! example "Check the Status of Queues for Certain Steps"
!!! example "Check the Status Using a Specification as Input & Ignore Any Prompts"

If multiple workspace directories associated with the spec file provided are found, the `--no-prompts` option will ignore the prompt and select the most recent study that was ran based on the timestamps.

```bash
merlin status my_specification.yaml --no-prompts
```

!!! example "Dump the Status Report to a CSV File"

```bash
merlin status study_name_20240129-123452/ --dump status_report.csv
```

!!! example "Look For Workspaces at a Certain Location"

```bash
merlin status my_specification.yaml --steps step_1 step_3
merlin status my_specification.yaml -o new_output_path/
```

!!! example "Dump the Status to a CSV File"
!!! example "Utilize the Colorblind Functionality"

```bash
merlin status my_specification.yaml --csv status_report.csv
merlin status study_name_20240129-123452/ --cb-help
```
33 changes: 33 additions & 0 deletions docs/user_guide/monitoring/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Monitoring Studies

This section of the documentation is dedicated to guiding you through the intricacies of monitoring studies with Merlin. From utilizing monitoring tools to interpreting their outputs, we'll explore how to leverage Merlin's monitoring features to enhance your study management experience.

## Key Objectives

1. **Real-Time Visibility**

- Gain instant insights into the progress of your studies.

- Monitor the status of individual tasks and their dependencies.

2. **Issue Identification and Resolution**

- Identify and address issues or bottlenecks in study execution promptly.

- Utilize monitoring data for efficient troubleshooting.

3. **Performance Optimization**

- Analyze historical data to identify patterns and optimize study workflows.

- Fine-tune parameters based on monitoring insights for enhanced efficiency.

## What is in This Section?

There are several commands used specifically for monitoring studies (see [Monitoring Commands](../command_line.md#monitoring-commands)). Throughout this section we'll discuss each and every one in further detail:

- [The Status Commands](./status_cmds.md): As you may have guessed, this module will cover the two status commands that Merlin provides ([`merlin status`](../command_line.md#status-merlin-status) and [`merlin detailed-status`](../command_line.md#detailed-status-merlin-detailed-status))

- [Querying Queues and Workers](./queues_and_workers.md): This module will discuss how queues and workers can be queried with the [`merlin queue-info`](../command_line.md#queue-info-merlin-queue-info) and the [`merlin query-workers`](../command_line.md#query-workers-merlin-query-workers) commands.

- [Monitoring Studies for Persistent Allocations](./monitor_for_allocation.md): Here we'll discuss how allocations can be kept alive using the [`merlin monitor`](../command_line.md#monitor-merlin-monitor) command.
Loading
Loading