Skip to content

Conversation

@nick-rfai
Copy link
Contributor

@nick-rfai nick-rfai commented Oct 6, 2025

Changes

  • Add new start script colab action
  • Integrate tensorboard into new metric_logger ABC
  • Create a new interactive controller python widget for interacting with dispatcher during colab
  • Allow run_fit to be backgrounded so that users can access python widget during training

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this change manually
  • I have tested this change in the following environments:
    • Local development
    • Colab
    • PBD
    • Regression tests pass on the lite notebooks

Screenshots + Test Steps

  1. For Colab notebook, you need a T4 GPU and you may need to purchase additional credits if you run multiple experiments. Also, the notebook auto shuts down after 5-10 minutes of no activity.
  2. pip install git+https://github.com/RapidFireAI/rapidfireai.git@feat/tensorboard-integration, ignore pip warnings.
  3. !rapidfireai init in a cell to download the tutorial notebooks, and bring over cells from colab-tensorboard notebook.
  4. Open Colab terminal, export tensorboard param + run start script
export RF_TRACKING_BACKEND=tensorboard
rapidfireai start --colab
Screenshot 2025-10-10 at 9 46 12 PM
  1. Run in Colab cell
import os
os.environ['RF_TRACKING_BACKEND'] = 'tensorboard'

%reload_ext tensorboard
  1. Run GPT2 experiment
  2. Make sure tensorboard widget shows up
Screenshot 2025-10-10 at 9 45 51 PM
  1. Make sure IC widget shows up (there should be no data here until run_fit is active)
  2. Run the run_fit cell. Note that on colab free tier account, there is an automatic timeout which will shutdown the notebook.
  3. Perform clone-modify or other command in IC op widget
Screenshot 2025-10-10 at 9 46 29 PM

Regression Test

Screenshot 2025-10-13 at 12 23 10 PM

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published

TODO

  • Verify new test cases work
  • Remove use of mlflow for colab + tensorboard version (can we also remove from install script?)
  • Run gpt2 notebook e2e
  • Update notebook to match colab test notebook
  • Make sure github version of notebook is syntax error-free
  • Test that no regressions introduced, shutdown script still works with non-colab path

Performance Impact

If this PR affects performance, describe the impact and any optimizations made.

Related Issues

Fixes #(issue number)
Closes #(issue number)
Related to #(issue number)

nick-rfai and others added 2 commits October 6, 2025 12:42
Add comprehensive CLAUDE.md documentation files to guide AI-assisted
development across all RapidFire modules. These files provide:

- Module overviews and architecture
- File-by-file documentation with purposes and key functions
- Usage examples and common patterns
- Integration points and dependencies
- Best practices and testing guidance

Files added:
- CLAUDE.md (repository root)
- rapidfireai/automl/CLAUDE.md
- rapidfireai/backend/CLAUDE.md
- rapidfireai/db/CLAUDE.md
- rapidfireai/dispatcher/CLAUDE.md
- rapidfireai/ml/CLAUDE.md
- rapidfireai/utils/CLAUDE.md (includes Colab support documentation)

These guidance files help Claude Code (and human developers) understand
the codebase structure and make consistent, informed changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@nick-rfai nick-rfai self-assigned this Oct 6, 2025
Copy link
Collaborator

@arun-rfai arun-rfai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The steps and the end to end demo look good to me. The gpt-2 notebook also looks good. Wait for PS to finish review before merging.

@pradyumna-rfai pradyumna-rfai merged commit c869783 into main Oct 13, 2025
2 checks passed
@pradyumna-rfai pradyumna-rfai deleted the feat/tensorboard-integration branch October 13, 2025 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants