Checkpoint removal 2 #250

tmarkovich · 2022-02-14T14:51:15Z

Types of changes

Docs change / refactoring / dependency upgrade
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Motivation and Context / Related issue

With 1bn entities, dumping the embedding table and reloading it was taking as long as training on a single 10bn edge chunk. This PR reduces the amount of check pointing to save us that cost which drastically speeds up single-instance embedding.

How Has This Been Tested (if it applies)

Verified that this worked at runtime

Checklist

The documentation is up-to-date with the changes I made.
I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
All tests passed, and additional code has been covered with new tests.

tmarkovich · 2022-02-14T14:52:22Z

@adamlerer @lw this is super hacky at the moment. I'll add some config options, but I wanted to get your feedback on this first before I went too far down a rabbit hole.

adamlerer

This change seems conceptually good to me. Of course, the code looks incorrect for the distributed case, but I assume you will fix that up for the final version.

We will probably need to test this change with some distributed runs. Unless you can make it very clearly correct "by inspection".

adamlerer · 2022-03-29T14:47:28Z

torchbiggraph/train_cpu.py

-                self.checkpoint_manager.write(
-                    entity, part, embs.detach(), optimizer.state_dict()
-                )
+                self._write_single_embedding(holder, entity, part)
                self.embedding_storage_freelist[entity].add(embs.storage())
                io_bytes += embs.numel() * embs.element_size()  # ignore optim state
                # these variables are holding large objects; let them be freed
                del embs


How do these lines work if you don't define embs and optimizer any more?

adamlerer · 2022-03-29T14:48:55Z

torchbiggraph/train_cpu.py

+        for entity, part in parts:
+            self._write_single_embedding(self.holder, entity, part)
+
+    def _write_stats(self, bucket: Optional[Bucket], stats: Optional[BucketStats]):


This naming is misleading... I think it does more than write stats in the distributed scheduler.

tmarkovich · 2022-04-01T20:13:16Z

This change seems conceptually good to me. Of course, the code looks incorrect for the distributed case, but I assume you will fix that up for the final version.

We will probably need to test this change with some distributed runs. Unless you can make it very clearly correct "by inspection".

Even if it's "clearly correct by inspection", I'd still feel better if you all were able to run some distributed versions as well.

Thomas Markovich added 2 commits February 11, 2022 22:46

Filtering the embs from the state dict

7d94db1

Swapped the logic to write only every k steps

017862c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 14, 2022

tmarkovich marked this pull request as draft February 14, 2022 14:51

hacks

f46539f

adamlerer reviewed Mar 29, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint removal 2 #250

Checkpoint removal 2 #250

tmarkovich commented Feb 14, 2022

tmarkovich commented Feb 14, 2022

adamlerer left a comment

adamlerer Mar 29, 2022

adamlerer Mar 29, 2022

tmarkovich commented Apr 1, 2022

Checkpoint removal 2 #250

Are you sure you want to change the base?

Checkpoint removal 2 #250

Conversation

tmarkovich commented Feb 14, 2022

Types of changes

Motivation and Context / Related issue

How Has This Been Tested (if it applies)

Checklist

tmarkovich commented Feb 14, 2022

adamlerer left a comment

Choose a reason for hiding this comment

adamlerer Mar 29, 2022

Choose a reason for hiding this comment

adamlerer Mar 29, 2022

Choose a reason for hiding this comment

tmarkovich commented Apr 1, 2022