Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/gcp tpu #609

Open
wants to merge 81 commits into
base: master
Choose a base branch
from

Conversation

jmikedupont2
Copy link

Here is my branch of hivemind that works on the gcp tpu

mryab and others added 30 commits June 20, 2022 16:40
- fix edge case where expert requests with 3.99-4MB payload would fail due to max message size (due to serialization overhead)
- recover from errors in the Runtime, propagate them to the corresponding tasks
   - previously, a failing function would terminate the entire server - which was a major pain for me personally :)
   - failure to process a request will now trigger P2PHandlerError instead of P2PDaemonError (cuz it does not kill the daemon)
- allow optional metadata in ExpertRequest / ExpertResponse for extendability [todo: validate it vs. @mryab ]

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
Co-authored-by: Pavel Samygin <samygin@phystech.edu>
(cherry picked from commit ef0b842)
Type of metadata field in Expert Request/Response changed to more native type `bytes` and some compatibility fixes are done to the tests to fit different `torch` versions

(cherry picked from commit fe7a4ef)
It is not immediately clear from the documentation that this example cannot run on multiple machines. This PR clarifies this.

(cherry picked from commit ee75b91)
* make DHT ignore SIGINT
* update p2pd version

Co-authored-by: @borzunov
(cherry picked from commit 61e5e8c)
…#494)

* Update README with latest projects and publications

* Reformat the BibTeX entries

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
(cherry picked from commit d42c703)
I think some people are interested in the "Example Use Cases" section because they'd like to know what was already built with hivemind, and other people would like to take a look on the code if they've already started to use hivemind and want some code examples.

Currently, the sahajBERT link leads to the sahajBERT repo that doesn't describe much about the project itself. Conversely, it's hard to find the repo with the code following the CALM and "Training Transformers Together" links.

This PR adds more useful links to each of the projects.

(cherry picked from commit 7a7c93a)
Co-authored-by: Alex <alexandershulga.sh@gmail.com>
(cherry picked from commit bb3aed6)
…earning-at-home#503)

This PR fixes a potential deadlock in hivemind.utils.enter_asynchronously.
This deadlock occurs when many coroutines enter nested locks and exhaust all workers in ThreadPoolExecutor.
In this PR, we mitigate it by creating a dedicated executor for entering locks with no limit to the number of workers.

Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com>
(cherry picked from commit b02bdad)
…home#506)

The TaskPoolBase interface currently requires iterate_minibatches to be implemented. However, this method is not called by anything except TaskPool (internally). Runtime actually calls load_batch_to_runtime. This PR changes the interface to reflect that.

While we're at it, i've also changed prefetch generator so that it actually does not prefetch batches when prefetch_batches = 0. Previously, 0 would silently mean "unlimited",

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
(cherry picked from commit 41587e4)
…iority (learning-at-home#505)

Currently, the priority is set to the timestamp of the earliest undispatched task.
Choosing earliest tasks will reduce the maximum waiting time when queue is nonempty

Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
Co-authored-by: Pavel Samygin <44449246+greenfatguy@users.noreply.github.com>
(cherry picked from commit 6395e89)
* Add support for quantization with bitsandbytes

* Extend the compression benchmark

* Add a test for blockwise compression

* Add a note to README about bitsandbytes

* Install bitsandbytes in tests as well

* Verify outputs consistently in test_moe.py
(to make the test less flaky)

* Pass device="cpu" in test_background_server_identity_path
This ensures that the server can actually launch in a GPU-enabled environment: otherwise initializing the CUDA context in a parent process prevents it

* Filter bitsandbytes warnings

(cherry picked from commit 131f82c)
forbid protobuf 4.x for now

(cherry picked from commit e9f35b5)
While using scripts built with hivemind, users often run two peers with the same identity by accident (e.g., if they forget to change the CLI command or copied the same identity file to another host via `scp`). Now, this leads to undefined behavior of libp2p.

This PR makes `hivemind.P2P` check if the identity is already taken, thus solving this issue in all applications at once.

(cherry picked from commit 64a6c30)
…xes (learning-at-home#513)

- In `hivemind.Server`, use the graceful shutdown for `ConnectionHandler`
- In `hivemind.P2P`, if we are the first peer, skip checking if the provided identity is free

(cherry picked from commit 13cdd13)
* Update bitsandbytes, relax its version constraint

(cherry picked from commit 44d9569)
Fixed the broken link in the tutorial.

(cherry picked from commit 3e817a5)
…home#517)

Currently, one may sometimes get the "unable to open shared memory" error (see the screenshot) while using `hivemind.MPFuture`. Interestingly, the smaller `HIVEMIND_SHM_BUFFER_SIZE` is, the more often the error occurs (e.g., in Petals, it occurs right after starting the server if `HIVEMIND_SHM_BUFFER_SIZE=2`).

Turns out, it happens when the origin process garbage-collects all instances of MPFuture using the same shmem buffer, then the underlying buffer is freed, and target processes can't reconnect to it anymore when unpickling its instances of MPFuture.

This PR fixes this important issue.

(cherry picked from commit 94c985d)
This is necessary for learning-at-home#521 to work. The minimal version where `torch.inference_mode()` works is 1.9.0.

(cherry picked from commit 1242cfb)
Before this PR, the P2P daemon was often killed after `idle_timeout` even if the persistent connection is opened due to a concurrency bug in go-libp2p-daemon that was just fixed: learning-at-home/go-libp2p-daemon#21

(cherry picked from commit 8d51b97)
This PR implements bfloat16 support for `CompressionType.NONE` and `CompressionType.BLOCKWISE_8BIT`.

This is important for the Petals client, see bigscience-workshop/petals#79

(cherry picked from commit 1e4af43)
…e#525)

Before this PR, hivemind-dht-based initial peers collected lots of stale PeerIDs and other peers could not actually make DHT queries anymore.

(cherry picked from commit be88b42)
This version contains relevant changes that improve work of libp2p relays, see learning-at-home/go-libp2p-daemon#22.

Co-authored-by: Pavel Samygin <44449246+greenfatguy@users.noreply.github.com>
(cherry picked from commit 4c167fa)
justheuristic and others added 30 commits March 31, 2023 16:55
- Fix LRSchedulerBase
- Handle None after .zero_grad() in torch 2.0.0
- Use set_to_none=True by default in torch>=2.0
- Add set_to_none param to TrainingStateAverager.step()

Co-authored-by: Aleksandr Borzunov <hxrussia@gmail.com>
(cherry picked from commit 98531ce)
…e#561)

Previously, `RemoteExpertWorker` ran one coroutine at a time, so hivemind.moe/Petals clients were very slow for concurrent calls.

(cherry picked from commit 589cb2c)
…#565)

This PR:

1. Fixes warnings in hivemind.p2p destructors.

2. Makes bfloat16 serialization in hivemind.compression forward- and backward-compatible. The code before this PR (a) didn't work in torch < 1.13.0 (hivemind requires torch >= 1.9.0) and (b) led to warnings on torch >= 2.0. The new code works without warnings in all versions of PyTorch.

(cherry picked from commit 0d2614d)
Pydantic 2.0 has been released yesterday and is not compatible with the current code.

(cherry picked from commit b7cbd97)
…-home#587)

* allow overriding args/kwargs in Runtime
* switch stats time to time.perf_counter

---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
(cherry picked from commit 33a9a41)
This doesn't change anything on Linux but helps macOS users. Specifically, it's helps to:

- Avoid [this error](bigscience-workshop/petals#405 (comment)) for people who don't use `if __name__ == "__main__"` in simple scripts on macOS (that uses spawn for processes by default).
- Make DHT consistent with other code that inherits from `mp.context.ForkProcess` directly.

(cherry picked from commit 1eb5d18)
…g-at-home#588)

This PR uses makes hivemind use a separate p2pd binary for each `(os, platform)`, so:

- Now we download 2x smaller binary for a specific macOS arch, instead of downloading the large universal binary
- Now we also provide `p2pd-linux-arm64` binary (maybe someone wants to run a DHT node on Raspberry Pi?)

(cherry picked from commit 27318f9)
* serialize with requires_grad
* ensure that all compression methods return tensor of the original dtype
* test that all compression methods preserve dtype and requires_grad


---------

Co-authored-by: Your Name <you@example.com>
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
…g-at-home#595)

* Install setuptools+wheel in develop mode during CI

* Fix deprecations and update dependencies for examples/albert
* Bump p2pd version
* Bump multiaddr
* Remove pymultihash
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.