-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug Description
Sequential execution fails when warm clones are added during runtime. The experiment crashes with "Invalid chunk_id 1" error after the first configuration completes and attempts to process the warm clone configuration
To Reproduce
Steps to reproduce the behavior:
- Start experiment with 2 initial configurations
- Set num_chunks=1 for sequential execution
- While first configuration is running, create a warm clone of it
- Wait for first configuration to complete
- Experiment fails when attempting to process the warm clone
Expected Behavior
Sequential execution should handle warm clones gracefully, allowing the cloned configuration to start after the parent configuration completes without throwing chunk validation errors.
Screenshots
Environment
- OS: Ubuntu
- Python version: 3.12
- RapidFire AI version: 0.12.6
- Browser (if applicable): Chrome
Additional Context
- Issue occurs specifically with sequential execution (num_chunks=1)
- Error happens during interactive control processing when attempting to calculate clone chunk offset
- The chunker validates last_completed_chunk but receives invalid chunk_id value of 1
- Problem appears to be in the chunk offset calculation logic for warm clones in sequential mode
Error Logs
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:299, in Controller._process_interactive_control(self, run_states, clone_modify_tasks, len_train_dataset, seed, num_chunks)
293 chunker = DatasetChunks(
294 len_train_dataset,
295 num_chunks,
296 batch_size=effective_batch_size,
297 offset=parent_run_details["chunk_offset"],
298 )
--> 299 clone_chunk_offset = chunker.get_clone_offset(parent_run_details["num_chunks_visited_curr_epoch"])
300 clone_modify_info = {
301 "cloned_from": parent_run_id,
302 "warm_started_from": parent_run_id,
303 "chunk_offset": clone_chunk_offset,
304 }
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/chunks.py:116, in DatasetChunks.get_clone_offset(self, last_completed_chunk)
115 if last_completed_chunk not in self.chunk_indices:
--> 116 raise ValueError(f"Invalid chunk_id {last_completed_chunk}")
118 # Get the end index of the last completed chunk
119 # This is where the next run should start
ValueError: Invalid chunk_id 1
The above exception was the direct cause of the following exception:
ControllerException Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:620, in Controller.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
619 run_states, clone_modify_tasks = self._process_interm_ic_ops_states(currently_scheduled_runs)
--> 620 self._process_interactive_control(
621 run_states,
622 clone_modify_tasks,
623 len_train_dataset,
624 seed,
625 num_chunks,
626 )
627 except Exception as e:
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:324, in Controller._process_interactive_control(self, run_states, clone_modify_tasks, len_train_dataset, seed, num_chunks)
323 self.ic_logger.error(f"Error creating model for run {parent_run_id}: {e}")
--> 324 raise ControllerException(f"Error creating model for run {parent_run_id}: {e}") from e
ControllerException: Error creating model for run 1: Invalid chunk_id 1
The above exception was the direct cause of the following exception:
ControllerException Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:628, in Controller.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
627 except Exception as e:
--> 628 raise ControllerException(f"Error processing interactive control tasks: {e}") from e
630 # fetch latest run states again (post IC ops states)
ControllerException: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1
The above exception was the direct cause of the following exception:
ControllerException Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/experiment.py:273, in Experiment.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
272 controller = Controller(self.experiment_id, self.experiment_name)
--> 273 controller.run_fit(param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
274 except Exception as e:
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:696, in Controller.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
695 except Exception as e:
--> 696 raise ControllerException(f"Error during run_fit: {e}") from e
698 # shutdown workers
ControllerException: Error during run_fit: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1
The above exception was the direct cause of the following exception:
ExperimentException Traceback (most recent call last)
Cell In[8], line 2
1 # Launch training of all configs in the config_group with swap granularity of 4 chunks
----> 2 experiment.run_fit(config_group, sample_create_model, train_dataset, eval_dataset=None, num_chunks=1, seed=42)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/experiment.py:277, in Experiment.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
275 if hasattr(self, "logger"):
276 self.logger.opt(exception=True).error(f"Error running fit: {e}")
--> 277 raise ExperimentException(f"Error running fit: {e}, traceback: {traceback.format_exc()}") from e
ExperimentException: Error running fit: Error during run_fit: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1, traceback: Traceback (most recent call last):
File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 299, in _process_interactive_control
clone_chunk_offset = chunker.get_clone_offset(parent_run_details["num_chunks_visited_curr_epoch"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/chunks.py", line 116, in get_clone_offset
raise ValueError(f"Invalid chunk_id {last_completed_chunk}")
ValueError: Invalid chunk_id 1
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 620, in run_fit
self._process_interactive_control(
File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 324, in _process_interactive_control
raise ControllerException(f"Error creating model for run {parent_run_id}: {e}") from e
rapidfireai.fit.utils.exceptions.ControllerException: Error creating model for run 1: Invalid chunk_id 1
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 628, in run_fit
raise ControllerException(f"Error processing interactive control tasks: {e}") from e
rapidfireai.fit.utils.exceptions.ControllerException: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/experiment.py", line 273, in run_fit
controller.run_fit(param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 696, in run_fit
raise ControllerException(f"Error during run_fit: {e}") from e
rapidfireai.fit.utils.exceptions.ControllerException: Error during run_fit: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working