Skip to content

Conversation

felipemello1
Copy link
Contributor

@felipemello1 felipemello1 commented Oct 8, 2025

When logging per rank, we need good naming to make it easier to debug. Now provisioner.py::get_proc_mesh provides a mesh_name (#378), so i am just using it.

User can also pass a process name as input. Thats how we get the Controller name.

mlogger = await get_or_create_metric_logger(process_name="Controller")

The process_name then goes: LocalFetcherActor -> MetricCollector -> backend.init --> wandb.init(name)

image

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2025
@allenwang28
Copy link
Contributor

could you add some more comments in the description about what this PR is enabling?

@felipemello1
Copy link
Contributor Author

could you add some more comments in the description about what this PR is enabling?

yes! sorry. I should have marked as a draft. I am doing a 2.5/4.0 before i ask you to review it

@felipemello1 felipemello1 marked this pull request as draft October 8, 2025 22:37
@felipemello1 felipemello1 changed the title Metric Logging updates 3/4 [wip] Metric Logging updates 3/4 Oct 8, 2025
@felipemello1 felipemello1 changed the title [wip] Metric Logging updates 3/4 Metric Logging updates 4/N Oct 9, 2025
self.timestamp = datetime.now(pytz.UTC).timestamp()


def get_actor_name_with_rank() -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to observability/utils.py

@felipemello1 felipemello1 marked this pull request as ready for review October 9, 2025 03:27
Comment on lines 81 to 82
if process_name is None:
process_name = detect_actor_name_from_call_stack()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get name here and pass it to

local_fetcher_actor = proc.spawn(
            "local_fetcher_actor", LocalFetcherActor, global_logger, process_name
        )

this function is called in provisioner.py, and thats how we get the process_name for every wandb run

logger = logging.getLogger(__name__)


def detect_actor_name_from_call_stack() -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main file to review

@felipemello1
Copy link
Contributor Author

felipemello1 commented Oct 10, 2025

in the near future the mesh might hold a name. When this happens, we can delete the function to use the call stack and just get it from the mesh. The rest of the PR stands.

@felipemello1 felipemello1 changed the title Metric Logging updates 4/N Metric Logging updates 4/N - better actor name Oct 14, 2025
@felipemello1 felipemello1 marked this pull request as draft October 14, 2025 15:31
@felipemello1 felipemello1 marked this pull request as ready for review October 14, 2025 20:06
Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few small comments but stamping to unblock

# Spawn services first (triggers registrations via provisioner hook)
trainer = await TrainActor.options(**service_config).as_service()
generator = await GeneratorActor.options(**service_config).as_service()
trainer = await TrainActor.options(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb q but what are we doing with this toy_rl stuff anyways? (Like is there some reason we're still keeping it around?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still find it useful, e.g. i dont need to have gpus available to test it. Eventually this should become an integration test.

Timestamp is automatically set to current EST time if not provided.
Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: give an actual docstring here (otherwise i can just read this info 5 lines below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines +100 to +102
if process_name is None:
ctx = context()
process_name = ctx.actor_instance.actor_id.actor_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small thing but it's not immediately clear why we do this here vs get_proc_name_with_rank in other places. (after looking at the code i think it's a global vs local thing, but imo this could be more clearly documented)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_proc_name_with_rank returns it with replica id and rank. Here i just want the process name, so i would have to parse it :/

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 92.20779% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.14%. Comparing base (4c14792) to head (e901ad5).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
src/forge/observability/metric_actors.py 57.14% 3 Missing ⚠️
src/forge/observability/metrics.py 62.50% 3 Missing ⚠️
tests/unit_tests/observability/conftest.py 50.00% 2 Missing ⚠️
tests/unit_tests/observability/test_utils.py 93.54% 2 Missing ⚠️
src/forge/controller/provisioner.py 0.00% 1 Missing ⚠️
src/forge/observability/utils.py 94.73% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #351      +/-   ##
==========================================
- Coverage   73.68%   65.14%   -8.54%     
==========================================
  Files          81       82       +1     
  Lines        7729     7850     +121     
==========================================
- Hits         5695     5114     -581     
- Misses       2034     2736     +702     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@felipemello1 felipemello1 merged commit 1f45470 into meta-pytorch:main Oct 15, 2025
9 checks passed
allenwang28 pushed a commit to allenwang28/forge that referenced this pull request Oct 15, 2025
Co-authored-by: Felipe Mello <felipemello@fb.com>
allenwang28 added a commit to allenwang28/forge that referenced this pull request Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants