Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding CPU training support to AxoNN #39

Open
wants to merge 28 commits into
base: develop
Choose a base branch
from
Open

Adding CPU training support to AxoNN #39

wants to merge 28 commits into from

Conversation

Avuxon
Copy link

@Avuxon Avuxon commented Oct 10, 2023

  • Switch to gloo for communication

@Avuxon
Copy link
Author

Avuxon commented Oct 10, 2023

Need to make the changes conditional on CPUs/GPUs

@siddharth9820 siddharth9820 changed the title initial cpu changes [WIP} Adding CPU training support to AxoNN Oct 10, 2023
@siddharth9820 siddharth9820 changed the title [WIP} Adding CPU training support to AxoNN [WIP] Adding CPU training support to AxoNN Oct 10, 2023
@siddharth9820 siddharth9820 added the WIP Work in progress label Oct 10, 2023
@Dando18
Copy link
Contributor

Dando18 commented Oct 11, 2023

Before merging this should have

  • modular cpu support; the existing code was okay for gpu setups, this code breaks the gpu functionality; the library should take a flag cpu_only to run in cpu mode; or it should auto-detect what it should do
  • add tests; this should be able to run on github runners since it's cpu

Dando18
Dando18 previously requested changes Nov 1, 2023
axonn/communication.py Outdated Show resolved Hide resolved
axonn/communication.py Outdated Show resolved Hide resolved
Sathwik Yanamaddi and others added 8 commits April 12, 2024 10:14
Add default values for environment vars

Fixed communication handle flags

Fixed formatting
Formatting fixed

formatting again

tensor-list change

Formatting and Device-Setting

Fixed gpus_per_node access
Co-authored-by: Mahua Singh <mahua04@pssg-mordor.umiacs.umd.edu>
* initialize grad_input to None

* minor
Added missing parameter to test

Fixed formatting

docs: fix build issues and add sub-sections (#69)
Copy link
Collaborator

@siddharth9820 siddharth9820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When attempting to fix the CI I discovered that the gloo backend doesn't support reduce-scatters. Therefore, as of now AxoNN wouldn't work on CPUs with G_intra_d>1. We should add an assert that checks this condition in axonn.py.

axonn/communication.py Outdated Show resolved Hide resolved
axonn/communication.py Outdated Show resolved Hide resolved
axonn/communication.py Outdated Show resolved Hide resolved
axonn/communication.py Outdated Show resolved Hide resolved
@Avuxon Avuxon requested a review from siddharth9820 May 24, 2024 17:29
@bhatele bhatele changed the title [WIP] Adding CPU training support to AxoNN Adding CPU training support to AxoNN May 24, 2024
@bhatele bhatele added ready-for-review and removed WIP Work in progress labels May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants