[BUG] Sequential execution fails with warm-clones - DPO

## Bug Description
Sequential execution fails when warm clones are added during runtime. The experiment crashes with "Invalid chunk_id 1" error after the first configuration completes and attempts to process the warm clone configuration

## To Reproduce
Steps to reproduce the behavior:

1. Start experiment with 2 initial configurations
2. Set num_chunks=1 for sequential execution
3. While first configuration is running, create a warm clone of it
4. Wait for first configuration to complete
5. Experiment fails when attempting to process the warm clone

## Expected Behavior
Sequential execution should handle warm clones gracefully, allowing the cloned configuration to start after the parent configuration completes without throwing chunk validation errors.

## Screenshots

<img width="1697" height="755" alt="Image" src="https://github.com/user-attachments/assets/64bec9b2-6d72-49e3-915c-90e8bbcefb24" />

## Environment
- OS: Ubuntu
- Python version: 3.12
- RapidFire AI version: 0.12.6
- Browser (if applicable): Chrome

## Additional Context

1. Issue occurs specifically with sequential execution (num_chunks=1)
2. Error happens during interactive control processing when attempting to calculate clone chunk offset
3. The chunker validates last_completed_chunk but receives invalid chunk_id value of 1
4. Problem appears to be in the chunk offset calculation logic for warm clones in sequential mode

## Error Logs

```
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:299, in Controller._process_interactive_control(self, run_states, clone_modify_tasks, len_train_dataset, seed, num_chunks)
    293 chunker = DatasetChunks(
    294     len_train_dataset,
    295     num_chunks,
    296     batch_size=effective_batch_size,
    297     offset=parent_run_details["chunk_offset"],
    298 )
--> 299 clone_chunk_offset = chunker.get_clone_offset(parent_run_details["num_chunks_visited_curr_epoch"])
    300 clone_modify_info = {
    301     "cloned_from": parent_run_id,
    302     "warm_started_from": parent_run_id,
    303     "chunk_offset": clone_chunk_offset,
    304 }

File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/chunks.py:116, in DatasetChunks.get_clone_offset(self, last_completed_chunk)
    115 if last_completed_chunk not in self.chunk_indices:
--> 116     raise ValueError(f"Invalid chunk_id {last_completed_chunk}")
    118 # Get the end index of the last completed chunk
    119 # This is where the next run should start

ValueError: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

ControllerException                       Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:620, in Controller.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    619     run_states, clone_modify_tasks = self._process_interm_ic_ops_states(currently_scheduled_runs)
--> 620     self._process_interactive_control(
    621         run_states,
    622         clone_modify_tasks,
    623         len_train_dataset,
    624         seed,
    625         num_chunks,
    626     )
    627 except Exception as e:

File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:324, in Controller._process_interactive_control(self, run_states, clone_modify_tasks, len_train_dataset, seed, num_chunks)
    323 self.ic_logger.error(f"Error creating model for run {parent_run_id}: {e}")
--> 324 raise ControllerException(f"Error creating model for run {parent_run_id}: {e}") from e

ControllerException: Error creating model for run 1: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

ControllerException                       Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:628, in Controller.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    627 except Exception as e:
--> 628     raise ControllerException(f"Error processing interactive control tasks: {e}") from e
    630 # fetch latest run states again (post IC ops states)

ControllerException: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

ControllerException                       Traceback (most recent call last)
File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/experiment.py:273, in Experiment.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    272     controller = Controller(self.experiment_id, self.experiment_name)
--> 273     controller.run_fit(param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    274 except Exception as e:

File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py:696, in Controller.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    695 except Exception as e:
--> 696     raise ControllerException(f"Error during run_fit: {e}") from e
    698 # shutdown workers

ControllerException: Error during run_fit: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

ExperimentException                       Traceback (most recent call last)
Cell In[8], line 2
      1 # Launch training of all configs in the config_group with swap granularity of 4 chunks
----> 2 experiment.run_fit(config_group, sample_create_model, train_dataset, eval_dataset=None, num_chunks=1, seed=42)

File ~/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/experiment.py:277, in Experiment.run_fit(self, param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
    275 if hasattr(self, "logger"):
    276     self.logger.opt(exception=True).error(f"Error running fit: {e}")
--> 277 raise ExperimentException(f"Error running fit: {e}, traceback: {traceback.format_exc()}") from e

ExperimentException: Error running fit: Error during run_fit: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1, traceback: Traceback (most recent call last):
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 299, in _process_interactive_control
    clone_chunk_offset = chunker.get_clone_offset(parent_run_details["num_chunks_visited_curr_epoch"])
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/chunks.py", line 116, in get_clone_offset
    raise ValueError(f"Invalid chunk_id {last_completed_chunk}")
ValueError: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 620, in run_fit
    self._process_interactive_control(
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 324, in _process_interactive_control
    raise ControllerException(f"Error creating model for run {parent_run_id}: {e}") from e
rapidfireai.fit.utils.exceptions.ControllerException: Error creating model for run 1: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 628, in run_fit
    raise ControllerException(f"Error processing interactive control tasks: {e}") from e
rapidfireai.fit.utils.exceptions.ControllerException: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/experiment.py", line 273, in run_fit
    controller.run_fit(param_config, create_model_fn, train_dataset, eval_dataset, num_chunks, seed)
  File "/home/palebluedot/miniconda3/envs/bench/lib/python3.12/site-packages/rapidfireai/fit/backend/controller.py", line 696, in run_fit
    raise ControllerException(f"Error during run_fit: {e}") from e
rapidfireai.fit.utils.exceptions.ControllerException: Error during run_fit: Error processing interactive control tasks: Error creating model for run 1: Invalid chunk_id 1
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Sequential execution fails with warm-clones - DPO #126

Bug Description

To Reproduce

Expected Behavior

Screenshots

Environment

Additional Context

Error Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Sequential execution fails with warm-clones - DPO #126

Description

Bug Description

To Reproduce

Expected Behavior

Screenshots

Environment

Additional Context

Error Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions