a preliminary implementation of single-process simulator, which still has problems #884

shuailong616 · 2025-10-29T08:43:10Z

Modify pr 389 to adapt the code for the new version of FlagScale
#389

gemini-code-assist · 2025-10-29T08:43:31Z

Summary of Changes

Hello @shuailong616, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers an initial implementation of a single-process simulator for the FlagScale framework. The simulator aims to facilitate the analysis of pipeline parallelism costs and memory consumption across various heterogeneous device configurations. By introducing a dummy distributed backend and adapting core training components, it allows for theoretical performance evaluation and optimal configuration generation in a non-distributed setting, which is crucial for efficient large-scale model training.

Highlights

Single-Process Simulator: Introduced a preliminary single-process simulator to analyze pipeline parallelism costs and memory usage for heterogeneous configurations without requiring a full distributed environment.
Dummy Distributed Backend: Added a custom C++ backend (dummy.hpp, dummy.cpp, setup.py) that provides mock implementations for PyTorch's c10d collective communication operations, enabling the simulator to run without actual distributed communication.
Configuration Generation and Analysis: New Python scripts (analylize_pipeline_time.py, config_gen.py) were added to generate possible heterogeneous mesh configurations, split layers across pipeline stages, calculate theoretical peak memory usage, and simulate execution times to find optimal setups.
Integration with Core Training Logic: Modified parallel_context.py to bypass actual distributed all_gather_object calls and use the dummy backend when the simulator is enabled. train.py was updated to include timing for forward/backward passes and to run in forward_only mode during simulation, while theoretical_memory_usage.py and train_gpt.py were adjusted to handle simulator-specific arguments and disable NaN checks.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a preliminary implementation of a single-process simulator for auto-tuning, which is a significant feature. The changes include a new dummy distributed backend, scripts for configuration generation and pipeline time analysis, and modifications to the training code to support simulation mode. My review has identified several issues, including two critical bugs that would either crash the program or break existing training functionality. I have also found a number of high-severity issues related to incorrect logic, security vulnerabilities, and maintainability problems like hardcoded paths and commands. I've provided detailed feedback and code suggestions to address these points. Overall, this is a good starting point, and addressing these comments will significantly improve the robustness and correctness of the simulator.

gemini-code-assist · 2025-10-29T08:45:50Z

flagscale/train/train.py

            micro_batch_size=args.micro_batch_size,
            decoder_seq_length=args.decoder_seq_length,
-            forward_only=False,
+            forward_only=True,


Setting forward_only=True unconditionally will break the backward pass and prevent the model from training. This change is intended for the simulator to measure forward pass time. It must be guarded by a check for simulator mode, for example: forward_only=args.enable_simulator.

Suggested change

forward_only=True,

forward_only=args.enable_simulator,

gemini-code-assist · 2025-10-29T08:45:50Z

flagscale/runner/auto_tuner/simulator/analylize_pipeline_time.py

+    # os.environ["WORLD_SIZE"] = args.world_size
+    os.environ["WORLD_SIZE"] = "8"
+    # os.environ["WORLD_SIZE"] = "32"
+    rdav_endpoint = random.randint(0, 40000)


There is a typo in the variable name rdav_endpoint. It should be rdzv_endpoint to match the environment variable RDZV_ENDPOINT set on the next line. This will cause a NameError.

Suggested change

rdav_endpoint = random.randint(0, 40000)

rdzv_endpoint = random.randint(0, 40000)

gemini-code-assist · 2025-10-29T08:45:51Z

flagscale/runner/auto_tuner/simulator/config_gen.py

+            else:
+                return False


The is_extreme_strategy function has a logic error. The else: return False statement is inside the loop, which will cause the function to return after checking only the first mesh. The return False should be moved outside the loop to ensure all meshes in the combination are checked.

): return True return False

gemini-code-assist · 2025-10-29T08:45:51Z

flagscale/runner/auto_tuner/simulator/analylize_pipeline_time.py

+    # each stage onlt depends on its next stage
+    if scheme == '1F1B' or scheme == 'AFAB':
+        pipeline_cost = pp_last_stage_time
+        for stage_from_last in range(2, num_pp_stages):


The loop for stage_from_last in range(2, num_pp_stages): seems to have an off-by-one error. It will not iterate over all the necessary stages. For example, if num_pp_stages is 3, the loop only runs for stage_from_last = 2, missing the calculation for the first stage (index 0). To include all stages from the second-to-last down to the first, the range should be range(2, num_pp_stages + 1).

Suggested change

for stage_from_last in range(2, num_pp_stages):

for stage_from_last in range(2, num_pp_stages + 1):

gemini-code-assist · 2025-10-29T08:45:51Z

flagscale/runner/auto_tuner/simulator/analylize_pipeline_time.py

+    os.environ["PYTHONPATH"] = (
+        "/workspace/20251010/new/FlagScale:"
+        "/workspace/20251010/new/FlagScale/third_party/Megatron-LM"
+    )


Hardcoding the PYTHONPATH makes the script brittle and not easily portable. These paths should be configurable, for example, by passing them as command-line arguments or reading them from a configuration file.

gemini-code-assist · 2025-10-29T08:45:52Z

flagscale/runner/auto_tuner/simulator/config_gen.py

+                config_file.write(f"{config_data}\n")
+
+    print(f"Hetero configurations saved to {output_config_file}")
+
+
+import ast
+import json
+
+
+def read_configs_from_json(file_path: str):
+    configs_list = []
+    with open(file_path, "r") as file:
+        for line in file:
+            # config_data = json.loads(line.strip())
+            config_data = ast.literal_eval(line.strip())


The script serializes configuration dictionaries to a file using str(dict) and deserializes them using ast.literal_eval. This is brittle and not a standard practice. It's better to use the json module (json.dumps to write and json.loads to read) for serializing and deserializing data. This is more robust and interoperable.

gemini-code-assist · 2025-10-29T08:45:52Z

flagscale/runner/auto_tuner/simulator/config_gen.py

+            pp_layers_split=hetero_config['pp_layer_split'],
+        )
+        print(f"pipeline cost: {pp_cost}")
+        break


The for loop in the main execution block is terminated by a break statement after the first iteration. This means only one configuration is ever simulated. This is likely for debugging purposes, but it should be removed to allow for a full simulation run over all generated configurations.

gemini-code-assist · 2025-10-29T08:45:52Z

flagscale/runner/auto_tuner/simulator/custom_backend/src/dummy.cpp

+c10::intrusive_ptr<Work> BackendDummy::allreduce(
+    std::vector<at::Tensor>& tensors,
+    const AllreduceOptions& opts) {
+  // printf("dummy allreduce\n");
+  for (auto& tensor : tensors) {
+      tensor.zero_();
+  }
+
+  auto future = c10::make_intrusive<c10::ivalue::Future>(
+    c10::ListType::create(c10::TensorType::get()));
+  future->markCompleted(c10::IValue(tensors));
+  return c10::make_intrusive<WorkDummy>(OpType::ALLGATHER, std::move(future));


The allreduce function incorrectly uses OpType::ALLGATHER when creating the WorkDummy object. It should use OpType::ALLREDUCE. This issue is present in many of the collective implementations in this file. Using the correct OpType is important for correctness and debugging, even in a dummy backend.

Suggested change

c10::intrusive_ptr<Work> BackendDummy::allreduce(

std::vector<at::Tensor>& tensors,

const AllreduceOptions& opts) {

// printf("dummy allreduce\n");

for (auto& tensor : tensors) {

tensor.zero_();

}

auto future = c10::make_intrusive<c10::ivalue::Future>(

c10::ListType::create(c10::TensorType::get()));

future->markCompleted(c10::IValue(tensors));

return c10::make_intrusive<WorkDummy>(OpType::ALLGATHER, std::move(future));

c10::intrusive_ptr<Work> BackendDummy::allreduce(

std::vector<at::Tensor>& tensors,

const AllreduceOptions& opts) {

// printf("dummy allreduce\n");

for (auto& tensor : tensors) {

tensor.zero_();

}

auto future = c10::make_intrusive<c10::ivalue::Future>(

c10::ListType::create(c10::TensorType::get()));

future->markCompleted(c10::IValue(tensors));

return c10::make_intrusive<WorkDummy>(OpType::ALLREDUCE, std::move(future));

}

gemini-code-assist · 2025-10-29T08:45:52Z

flagscale/train/train.py

+        #use_gloo_process_groups=args.enable_gloo_process_groups,
+        #use_gloo_process_groups=False,
        # If the user is asking for a non-zero embedding init std, skip weight decay for embeddings
        #  to avoid embeddings from shrinking to zero as recommended in https://arxiv.org/abs/2312.16903
-        default_skip_embedding_weight_decay=args.embedding_init_method_std is not None,
+        #default_skip_embedding_weight_decay=args.embedding_init_method_std is not None,


There are several commented-out arguments in the get_megatron_optimizer call. This can make the code confusing and harder to maintain. If these arguments are not needed, they should be removed. If they are conditionally needed, their inclusion should be controlled by a flag.

gemini-code-assist · 2025-10-29T08:45:52Z

flagscale/runner/auto_tuner/simulator/analylize_pipeline_time.py

+    ), "\flength of list {num_layers_per_stage} should match {num_stages}"
+    assert (
+        len(fwd_time_per_stage_chunk) == num_pp_stages
+    ), "\flength of list {fwd_time_per_stage_chunk} should match {num_stages}"
+    assert (
+        len(bwd_time_per_stage_chunk) == num_pp_stages
+    ), "\flength of list {bwd_time_per_stage_chunk} should match {num_stages}"
+    assert (
+        len(comm_time_between_stages) == num_pp_stages
+    ), "\flength of list {comm_time_between_stages} should match {num_stages}"


The assertion messages on lines 33, 36, 39, and 42 use \f which is a form-feed character. This is likely a typo and was intended to be an f-string for proper message formatting. Additionally, the variable names inside the string are incorrect. For example, on line 33, {num_layers_per_stage} should be {len(pp_layers_split)} and {num_stages} should be {num_pp_stages}.

Suggested change

), "\flength of list {num_layers_per_stage} should match {num_stages}"

assert (

len(fwd_time_per_stage_chunk) == num_pp_stages

), "\flength of list {fwd_time_per_stage_chunk} should match {num_stages}"

assert (

len(bwd_time_per_stage_chunk) == num_pp_stages

), "\flength of list {bwd_time_per_stage_chunk} should match {num_stages}"

assert (

len(comm_time_between_stages) == num_pp_stages

), "\flength of list {comm_time_between_stages} should match {num_stages}"

), f"length of list pp_layers_split {len(pp_layers_split)} should match num_pp_stages {num_pp_stages}"

assert (

len(fwd_time_per_stage_chunk) == num_pp_stages

), f"length of list fwd_time_per_stage_chunk {len(fwd_time_per_stage_chunk)} should match num_pp_stages {num_pp_stages}"

assert (

len(bwd_time_per_stage_chunk) == num_pp_stages

), f"length of list bwd_time_per_stage_chunk {len(bwd_time_per_stage_chunk)} should match num_pp_stages {num_pp_stages}"

assert (

len(comm_time_between_stages) == num_pp_stages

), f"length of list comm_time_between_stages {len(comm_time_between_stages)} should match num_pp_stages {num_pp_stages}"

CLAassistant · 2025-11-18T00:43:21Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

shuailong616 added 2 commits October 29, 2025 16:19

Single-process simulator adapted to the new version of Flagscale code

99344cf

Single-process simulator adapted to the new version of Flagscale code

835581a

shuailong616 requested review from a team and aoyulong as code owners October 29, 2025 08:43

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

shuailong616 and others added 3 commits October 29, 2025 19:17

modified arguments.py

9b45cde

modified config_gen.py

e655abe

Merge branch 'flagos-ai:main' into single-process-simulator

a6c3e5f

add vpp scheme & support simulate all stage

12eb33c

	rdav_endpoint = random.randint(0, 40000)
	rdzv_endpoint = random.randint(0, 40000)

	for stage_from_last in range(2, num_pp_stages):
	for stage_from_last in range(2, num_pp_stages + 1):

a preliminary implementation of single-process simulator, which still has problems #884

Are you sure you want to change the base?

a preliminary implementation of single-process simulator, which still has problems #884

Uh oh!

Conversation

shuailong616 commented Oct 29, 2025

Uh oh!

gemini-code-assist bot commented Oct 29, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants