Skip to content

[BUG] Sequential execution fails with warm-clones - DPO #126

@anay-rfai

Description

@anay-rfai

Bug Description

Sequential execution fails when warm clones are added during runtime. The experiment crashes with "Invalid chunk_id 1" error after the first configuration completes and attempts to process the warm clone configuration

To Reproduce

Steps to reproduce the behavior:

  1. Start experiment with 2 initial configurations
  2. Set num_chunks=1 for sequential execution
  3. While first configuration is running, create a warm clone of it
  4. Wait for first configuration to complete
  5. Experiment fails when attempting to process the warm clone

Expected Behavior

Sequential execution should handle warm clones gracefully, allowing the cloned configuration to start after the parent configuration completes without throwing chunk validation errors.

Screenshots

Image

Environment

  • OS: Ubuntu
  • Python version: 3.12
  • RapidFire AI version: 0.12.6
  • Browser (if applicable): Chrome

Additional Context

  1. Issue occurs specifically with sequential execution (num_chunks=1)
  2. Error happens during interactive control processing when attempting to calculate clone chunk offset
  3. The chunker validates last_completed_chunk but receives invalid chunk_id value of 1
  4. Problem appears to be in the chunk offset calculation logic for warm clones in sequential mode

Error Logs

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:299, in Controller._process_interactive_control(self, run_states, clone_modify_tasks, len_train_dataset, seed, num_chunks)
    293 chunker = DatasetChunks(
    294     len_train_dataset,
    295     num_chunks,
    296     batch_size=effective_batch_size,
    297     offset=parent_run_details["chunk_offset"],
    298 )
--> 299 clone_chunk_offset = chunker.get_clone_offset(parent_run_details["num_chunks_visited_curr_epoch"])
    300 clone_modify_info = {
    301     "cloned_from": parent_run_id,
    302     "warm_started_from": parent_run_id,
    303     "chunk_offset": clone_chunk_offset,
    304 }

File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/chunks.py:116, in DatasetChunks.get_clone_offset(self, last_completed_chunk)
    115 if last_completed_chunk not in self.chunk_indices:
--> 116     raise ValueError(f"Invalid chunk_id {last_completed_chunk}")
    118 # Get the end index of the last completed chunk
    119 # This is where the next run should start

ValueError: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

ControllerException                       Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:620, in Controller.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    619     run_states, clone_modify_tasks = self._process_interm_ic_ops_states(currently_scheduled_runs)
--> 620     self._process_interactive_control(
    621         run_states,
    622         clone_modify_tasks,
    623         len_train_dataset,
    624         seed,
    625         num_chunks,
    626     )
    627 except Exception as e:

File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:324, in Controller._process_interactive_control(self, run_states, clone_modify_tasks, len_train_dataset, seed, num_chunks)
    323 self.ic_logger.error(f"Error creating model for run {parent_run_id}: {e}")
--> 324 raise ControllerException(f"Error creating model for run {parent_run_id}: {e}") from e

ControllerException: Error creating model for run 1: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

ControllerException                       Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:628, in Controller.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    627 except Exception as e:
--> 628     raise ControllerException(f"Error processing interactive control tasks: {e}") from e
    630 # fetch latest run states again (post IC ops states)

ControllerException: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

ControllerException                       Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/experiment.py:273, in Experiment.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    272     controller = Controller(self.experiment_id, self.experiment_name)
--> 273     controller.run_fit(param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    274 except Exception as e:

File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:696, in Controller.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    695 except Exception as e:
--> 696     raise ControllerException(f"Error during run_fit: {e}") from e
    698 # shutdown workers

ControllerException: Error during run_fit: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

ExperimentException                       Traceback (most recent call last)
Cell In[8], line 2
      1 # Launch training of all configs in the config_group with swap granularity of 4 chunks
----> 2 experiment.run_fit(config_group, sample_create_model, train_dataset, eval_dataset=None, num_chunks=1, seed=42)

File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/experiment.py:277, in Experiment.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    275 if hasattr(self, "logger"):
    276     self.logger.opt(exception=True).error(f"Error running fit: {e}")
--> 277 raise ExperimentException(f"Error running fit: {e}, traceback: {traceback.format_exc()}") from e

ExperimentException: Error running fit: Error during run_fit: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1, traceback: Traceback (most recent call last):
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 299, in _process_interactive_control
    clone_chunk_offset = chunker.get_clone_offset(parent_run_details["num_chunks_visited_curr_epoch"])
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/chunks.py", line 116, in get_clone_offset
    raise ValueError(f"Invalid chunk_id {last_completed_chunk}")
ValueError: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 620, in run_fit
    self._process_interactive_control(
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 324, in _process_interactive_control
    raise ControllerException(f"Error creating model for run {parent_run_id}: {e}") from e
rapidfireai.fit.utils.exceptions.ControllerException: Error creating model for run 1: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 628, in run_fit
    raise ControllerException(f"Error processing interactive control tasks: {e}") from e
rapidfireai.fit.utils.exceptions.ControllerException: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/experiment.py", line 273, in run_fit
    controller.run_fit(param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 696, in run_fit
    raise ControllerException(f"Error during run_fit: {e}") from e
rapidfireai.fit.utils.exceptions.ControllerException: Error during run_fit: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions