Skip to content

Conversation

@shuailong616
Copy link
Contributor

Modify pr 389 to adapt the code for the new version of FlagScale
#389

@shuailong616 shuailong616 requested review from a team and aoyulong as code owners October 29, 2025 08:43
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @shuailong616, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers an initial implementation of a single-process simulator for the FlagScale framework. The simulator aims to facilitate the analysis of pipeline parallelism costs and memory consumption across various heterogeneous device configurations. By introducing a dummy distributed backend and adapting core training components, it allows for theoretical performance evaluation and optimal configuration generation in a non-distributed setting, which is crucial for efficient large-scale model training.

Highlights

  • Single-Process Simulator: Introduced a preliminary single-process simulator to analyze pipeline parallelism costs and memory usage for heterogeneous configurations without requiring a full distributed environment.
  • Dummy Distributed Backend: Added a custom C++ backend (dummy.hpp, dummy.cpp, setup.py) that provides mock implementations for PyTorch's c10d collective communication operations, enabling the simulator to run without actual distributed communication.
  • Configuration Generation and Analysis: New Python scripts (analylize_pipeline_time.py, config_gen.py) were added to generate possible heterogeneous mesh configurations, split layers across pipeline stages, calculate theoretical peak memory usage, and simulate execution times to find optimal setups.
  • Integration with Core Training Logic: Modified parallel_context.py to bypass actual distributed all_gather_object calls and use the dummy backend when the simulator is enabled. train.py was updated to include timing for forward/backward passes and to run in forward_only mode during simulation, while theoretical_memory_usage.py and train_gpt.py were adjusted to handle simulator-specific arguments and disable NaN checks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a preliminary implementation of a single-process simulator for auto-tuning, which is a significant feature. The changes include a new dummy distributed backend, scripts for configuration generation and pipeline time analysis, and modifications to the training code to support simulation mode. My review has identified several issues, including two critical bugs that would either crash the program or break existing training functionality. I have also found a number of high-severity issues related to incorrect logic, security vulnerabilities, and maintainability problems like hardcoded paths and commands. I've provided detailed feedback and code suggestions to address these points. Overall, this is a good starting point, and addressing these comments will significantly improve the robustness and correctness of the simulator.

micro_batch_size=args.micro_batch_size,
decoder_seq_length=args.decoder_seq_length,
forward_only=False,
forward_only=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Setting forward_only=True unconditionally will break the backward pass and prevent the model from training. This change is intended for the simulator to measure forward pass time. It must be guarded by a check for simulator mode, for example: forward_only=args.enable_simulator.

Suggested change
forward_only=True,
forward_only=args.enable_simulator,

# os.environ["WORLD_SIZE"] = args.world_size
os.environ["WORLD_SIZE"] = "8"
# os.environ["WORLD_SIZE"] = "32"
rdav_endpoint = random.randint(0, 40000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a typo in the variable name rdav_endpoint. It should be rdzv_endpoint to match the environment variable RDZV_ENDPOINT set on the next line. This will cause a NameError.

Suggested change
rdav_endpoint = random.randint(0, 40000)
rdzv_endpoint = random.randint(0, 40000)

Comment on lines +118 to +119
else:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The is_extreme_strategy function has a logic error. The else: return False statement is inside the loop, which will cause the function to return after checking only the first mesh. The return False should be moved outside the loop to ensure all meshes in the combination are checked.

            ):
                return True
        return False

# each stage onlt depends on its next stage
if scheme == '1F1B' or scheme == 'AFAB':
pipeline_cost = pp_last_stage_time
for stage_from_last in range(2, num_pp_stages):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The loop for stage_from_last in range(2, num_pp_stages): seems to have an off-by-one error. It will not iterate over all the necessary stages. For example, if num_pp_stages is 3, the loop only runs for stage_from_last = 2, missing the calculation for the first stage (index 0). To include all stages from the second-to-last down to the first, the range should be range(2, num_pp_stages + 1).

Suggested change
for stage_from_last in range(2, num_pp_stages):
for stage_from_last in range(2, num_pp_stages + 1):

Comment on lines 93 to 96
os.environ["PYTHONPATH"] = (
"/workspace/20251010/new/FlagScale:"
"/workspace/20251010/new/FlagScale/third_party/Megatron-LM"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding the PYTHONPATH makes the script brittle and not easily portable. These paths should be configurable, for example, by passing them as command-line arguments or reading them from a configuration file.

Comment on lines +335 to +349
config_file.write(f"{config_data}\n")

print(f"Hetero configurations saved to {output_config_file}")


import ast
import json


def read_configs_from_json(file_path: str):
configs_list = []
with open(file_path, "r") as file:
for line in file:
# config_data = json.loads(line.strip())
config_data = ast.literal_eval(line.strip())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script serializes configuration dictionaries to a file using str(dict) and deserializes them using ast.literal_eval. This is brittle and not a standard practice. It's better to use the json module (json.dumps to write and json.loads to read) for serializing and deserializing data. This is more robust and interoperable.

pp_layers_split=hetero_config['pp_layer_split'],
)
print(f"pipeline cost: {pp_cost}")
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The for loop in the main execution block is terminated by a break statement after the first iteration. This means only one configuration is ever simulated. This is likely for debugging purposes, but it should be removed to allow for a full simulation run over all generated configurations.

Comment on lines +136 to +147
c10::intrusive_ptr<Work> BackendDummy::allreduce(
std::vector<at::Tensor>& tensors,
const AllreduceOptions& opts) {
// printf("dummy allreduce\n");
for (auto& tensor : tensors) {
tensor.zero_();
}

auto future = c10::make_intrusive<c10::ivalue::Future>(
c10::ListType::create(c10::TensorType::get()));
future->markCompleted(c10::IValue(tensors));
return c10::make_intrusive<WorkDummy>(OpType::ALLGATHER, std::move(future));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The allreduce function incorrectly uses OpType::ALLGATHER when creating the WorkDummy object. It should use OpType::ALLREDUCE. This issue is present in many of the collective implementations in this file. Using the correct OpType is important for correctness and debugging, even in a dummy backend.

Suggested change
c10::intrusive_ptr<Work> BackendDummy::allreduce(
std::vector<at::Tensor>& tensors,
const AllreduceOptions& opts) {
// printf("dummy allreduce\n");
for (auto& tensor : tensors) {
tensor.zero_();
}
auto future = c10::make_intrusive<c10::ivalue::Future>(
c10::ListType::create(c10::TensorType::get()));
future->markCompleted(c10::IValue(tensors));
return c10::make_intrusive<WorkDummy>(OpType::ALLGATHER, std::move(future));
c10::intrusive_ptr<Work> BackendDummy::allreduce(
std::vector<at::Tensor>& tensors,
const AllreduceOptions& opts) {
// printf("dummy allreduce\n");
for (auto& tensor : tensors) {
tensor.zero_();
}
auto future = c10::make_intrusive<c10::ivalue::Future>(
c10::ListType::create(c10::TensorType::get()));
future->markCompleted(c10::IValue(tensors));
return c10::make_intrusive<WorkDummy>(OpType::ALLREDUCE, std::move(future));
}

Comment on lines +1402 to +1406
#use_gloo_process_groups=args.enable_gloo_process_groups,
#use_gloo_process_groups=False,
# If the user is asking for a non-zero embedding init std, skip weight decay for embeddings
# to avoid embeddings from shrinking to zero as recommended in https://arxiv.org/abs/2312.16903
default_skip_embedding_weight_decay=args.embedding_init_method_std is not None,
#default_skip_embedding_weight_decay=args.embedding_init_method_std is not None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are several commented-out arguments in the get_megatron_optimizer call. This can make the code confusing and harder to maintain. If these arguments are not needed, they should be removed. If they are conditionally needed, their inclusion should be controlled by a flag.

Comment on lines 33 to 42
), "\flength of list {num_layers_per_stage} should match {num_stages}"
assert (
len(fwd_time_per_stage_chunk) == num_pp_stages
), "\flength of list {fwd_time_per_stage_chunk} should match {num_stages}"
assert (
len(bwd_time_per_stage_chunk) == num_pp_stages
), "\flength of list {bwd_time_per_stage_chunk} should match {num_stages}"
assert (
len(comm_time_between_stages) == num_pp_stages
), "\flength of list {comm_time_between_stages} should match {num_stages}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion messages on lines 33, 36, 39, and 42 use \f which is a form-feed character. This is likely a typo and was intended to be an f-string for proper message formatting. Additionally, the variable names inside the string are incorrect. For example, on line 33, {num_layers_per_stage} should be {len(pp_layers_split)} and {num_stages} should be {num_pp_stages}.

Suggested change
), "\flength of list {num_layers_per_stage} should match {num_stages}"
assert (
len(fwd_time_per_stage_chunk) == num_pp_stages
), "\flength of list {fwd_time_per_stage_chunk} should match {num_stages}"
assert (
len(bwd_time_per_stage_chunk) == num_pp_stages
), "\flength of list {bwd_time_per_stage_chunk} should match {num_stages}"
assert (
len(comm_time_between_stages) == num_pp_stages
), "\flength of list {comm_time_between_stages} should match {num_stages}"
), f"length of list pp_layers_split {len(pp_layers_split)} should match num_pp_stages {num_pp_stages}"
assert (
len(fwd_time_per_stage_chunk) == num_pp_stages
), f"length of list fwd_time_per_stage_chunk {len(fwd_time_per_stage_chunk)} should match num_pp_stages {num_pp_stages}"
assert (
len(bwd_time_per_stage_chunk) == num_pp_stages
), f"length of list bwd_time_per_stage_chunk {len(bwd_time_per_stage_chunk)} should match num_pp_stages {num_pp_stages}"
assert (
len(comm_time_between_stages) == num_pp_stages
), f"length of list comm_time_between_stages {len(comm_time_between_stages)} should match num_pp_stages {num_pp_stages}"

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants