Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not log full worker info in retire_workers #8935

Merged
merged 3 commits into from
Nov 14, 2024

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Nov 14, 2024

Whenever we retire a worker we're apparently logging the WorkerState.identity information to the internal event system.

I've seen individual events of this message grow to 150KiB and beyond for large clusters.

Apart from the unnecessary memory this eats up, this can cause issues when the message is ingested elsewhere. It can also become quite a substantial problem if this message is tried being printed since that can easily overflow buffers and lock up the scheduler process.

While all of those ingestion/logging issues are not the fault of this code, logging this information is entirely unnecessary and too verbose.

As an example of the information that's being logged

{
    "tls://123.0.456.789:1234": {
        "type": "Worker",
        "id": "12345",
        "host": "123.0.456.789",
        "resources": {},
        "local_directory": "/scratch/dask-scratch-space/worker-foo",
        "name": "12345",
        "nthreads": 4,
        "memory_limit": 15923011584,
        "last_seen": 1731554289.3153517,
        "services": {"dashboard": 8788},
        "metrics": {
            "task_counts": {"memory": 45},
            "bandwidth": {"total": 100000000, "workers": {}, "types": {}},
            "digests_total_since_heartbeat": {
                "latency": 0.0008902549743652344,
                "tick-duration": 2.978916883468628,
                ("get-data", "memory-read", "count"): 24,
                ("get-data", "memory-read", "bytes"): 4943466,
                ("get-data", "serialize", "seconds"): 0.08324008999994703,
                ("get-data", "compress", "seconds"): 0.0023884560009150846,
                ("get-data", "network", "seconds"): 0.38302203345165253,
            },
            "managed_bytes": 9270953,
            "spilled_bytes": {"memory": 0, "disk": 0},
            "transfer": {
                "incoming_bytes": 0,
                "incoming_count": 0,
                "incoming_count_total": 0,
                "outgoing_bytes": 0,
                "outgoing_count": 0,
                "outgoing_count_total": 10,
            },
            "event_loop_interval": 0.02000030517578125,
            "cpu": 2.0,
            "memory": 847728640,
            "time": 1731554289.0746875,
            "host_net_io": {
                "read_bps": 2962.6441596327654,
                "write_bps": 2337.352438820186,
            },
            "host_disk_io": {"read_bps": 0.0, "write_bps": 0.0},
            "gil_contention": 0.0004726344777736813,
            "num_fds": 33,
        },
        "status": "closed",
        "nanny": "tls://123.0.456.789:1234",
    }
}

A few of those keys may be remotely interesting when debugging why / why not certain workers are retiring but I don't think this approach is feasible.

Copy link
Contributor

github-actions bot commented Nov 14, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    25 files  ±  0      25 suites  ±0   9h 28m 56s ⏱️ - 49m 30s
 4 132 tests +  2   4 014 ✅  -   1    112 💤 + 2   5 ❌ ± 0  1 🔥 +1 
47 330 runs   - 362  45 211 ✅  - 353  2 101 💤  - 21  17 ❌ +11  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit 57291b7. ± Comparison against base commit 5bedd3f.

♻️ This comment has been updated with latest results.

@fjetter fjetter merged commit d7eff77 into dask:main Nov 14, 2024
25 of 29 checks passed
@fjetter fjetter deleted the less_verbose_retirement_logs branch November 14, 2024 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants