[DRAFT] ppo chess with llm and ConditionalPolicySwitch to sunfish bot #2763

mikaylagawarecki · 2025-02-05T19:27:56Z

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

pytorch-bot · 2025-02-05T19:28:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2763

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCM Infra failures during checkout of PyTorch

❌ 7 New Failures, 8 Unrelated Failures

As of commit f5d2b15 with merge base dbc8e2e ():

NEW FAILURES - The following jobs have failed:

Build Windows Wheels / pytorch/rl / upload / wheel-py3_9-cpu (gh)
Unable to download artifact(s): Artifact not found for name: pytorch_rl__3.9_cpu_
Build Windows Wheels / pytorch/rl / upload / wheel-py3_9-cuda11_8 (gh)
Unable to download artifact(s): Artifact not found for name: pytorch_rl__3.9_cu118_
Continuous Benchmark (PR) / CPU Pytest benchmark (gh)
Process completed with exit code 1.
Continuous Benchmark (PR) / GPU Pytest benchmark (gh)
Process completed with exit code 1.
Lint / python-source-and-configs / linux-job (gh)
Unit-tests on Linux / tests-olddeps (3.8, 11.6) / linux-job (gh)
RuntimeError: Command docker exec -t d906bf47af140328be841eeaa45ac797c28f05c3a86210463469146039489e23 /exec failed with exit code 1
Unit-tests on Windows / unittests-cpu / windows-job (gh)
Process completed with exit code 1.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Build Windows Wheels / pytorch/rl / upload / wheel-py3_9-cuda12_4 (gh) (trunk failure)
Unable to download artifact(s): Artifact not found for name: pytorch_rl__3.9_cu124_
Build Windows Wheels / pytorch/rl / upload / wheel-py3_9-cuda12_6 (gh) (trunk failure)
Habitat Tests on Linux / tests (3.9, 12.4) / linux-job (gh) (trunk failure)
AttributeError: _ARRAY_API not found
Unit-tests on Linux / tests-optdeps (3.11, 12.4) / linux-job (gh) (trunk failure)
test/test_transforms.py::TestTokenizer::test_transform_inverse
Wheels / build-wheel-windows (3.10, 3.10.3) (gh) (trunk failure)
This request has been automatically failed because it uses a deprecated version of actions/upload-artifact: v3. Learn more: https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/
Wheels / build-wheel-windows (3.11, 3.11) (gh) (trunk failure)
This request has been automatically failed because it uses a deprecated version of actions/upload-artifact: v3. Learn more: https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/
Wheels / build-wheel-windows (3.12, 3.12) (gh) (trunk failure)
Wheels / build-wheel-windows (3.9, 3.9) (gh) (trunk failure)
This request has been automatically failed because it uses a deprecated version of actions/upload-artifact: v3. Learn more: https://github.blog/changelog/2024-04-16-deprecation-notice-v3-of-the-artifact-actions/

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: a0f5b1468f42104c76996e53d037da3a67e80156 Pull Request resolved: #2763

[ghstack-poisoned]

ghstack-source-id: 8d779f57e748d0ff755b7cce76ca560aee4d8c6f Pull Request resolved: #2763

[ghstack-poisoned]

ghstack-source-id: c1a108c14fe9f7a4f3c23a34272e9ae21c2eefa6 Pull Request resolved: #2763

mikaylagawarecki · 2025-02-05T20:08:03Z

examples/agents/ppo-chess-llm.py

+                hidden = output.hidden_states[-1][:, input_length - 1, :]
+                return log_prob, hidden
+            else:
+                while True:


When collecting data, I do the following

(1) The LLM input will be something like the following tokenized

You are playing a game of chess. The list of moves so far are [<start>, Nf3, Nh6, Nc3] and the legal moves are [Rg8, Nc6, Na6, Ng8, Nf5, Ng4, g6, f6, e6, d6, c6, b6, a6, g5, f5, e5, d5, c5, b5, a5]. Please choose one of the legal moves. Respond only with the following sentence, with no additional explanatory text. Example Answer: I choose Rg8!

(2) I generate a maximum of 7 new tokens in loop, (using argmax over the logits to sample each token) breaking if any one of them is ! (this is supposed to be the end of the sentence)
(3) I use regex to verify the format + the chosen move was legal
(4) Repeat (2) and (3) until a valid move is chosen
(5) pad output_tokens to 7

The reasoning for (5) is that I need to make sure that during the ppo loss computation, the distribution in dist, must have the same dimension action

Does all the above sound reasonable or is there something better I could be doing here?

Yeah I think it's good!
The while loop for generation is a bit worrisome - because then the samples are not really gathered from the LLM.
One way to do this could be to terminate the game if the output is not valid and assign a losing reward (like -1 or smth)

mikaylagawarecki · 2025-02-05T20:09:45Z

examples/agents/ppo-chess-llm.py

+    actor_llm_policy = ProbSeq(
+        Mod(
+            LLMWrapper(llm, tokenizer, mode="policy"),
+            in_keys=["obs_tokens"],
+            out_keys=["logits", "hidden"],
+        ),
+        prob_module,
+        # if use lambda: 'function' object has no attribute 'in_keys'
+        Mod(AggregateProb(), in_keys=["sample_log_prob"], out_keys=["sample_log_prob"]),
+        return_composite=True,
+    )


Does how data_llm_policy and actor_llm_policy were defined make sense here?

mikaylagawarecki · 2025-02-05T20:11:27Z

examples/agents/ppo-chess-llm.py

+        return tensor1_padded, tensor2_padded
+
+    for data in tqdm(collector):
+        # FIXME: reward seems to be getting wrongly propagated (e.g. sunfish's win gets reflected as llm's win)


debugging this one, not sure whether it's an issue with how I applied the transforms

mikaylagawarecki · 2025-02-05T20:18:19Z

examples/agents/ppo-chess-llm.py

+    # layout=torch.jagged errors with Qwen
+    # File "/home/mg1998/.conda/envs/rl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 859, in forward
+    # cache_position = torch.arange(
+    #             past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+    #         )
+    # AttributeError: 'ConstantIntNode' object has no attribute 'add'


The takeaway here for me is that we can't generically expect to use NJT for input_tokens to the LLM unless the LLM forward is written in an "NJT friendly" manner without any queries to sizes/implicit assumptions about dense input, in this case Qwen is making some such assumptions..

[ghstack-poisoned]

ghstack-source-id: 5b0ae766a774d3cfc78b3c5d1c07a3633b9e1345 Pull Request resolved: #2763

…sunfish bot" [ghstack-poisoned]

ghstack-source-id: ac573be6daed7607813f24db6e5706069a308718 Pull Request resolved: #2763

…sunfish bot" [ghstack-poisoned]

ghstack-source-id: 43ed58c8609054368e5170b980893b948266a65b Pull Request resolved: #2763

mikaylagawarecki · 2025-02-06T05:36:46Z

examples/agents/ppo-chess-llm.py

+        return tensordict_reset
+
+
+class Score(Transform):


trying this instead of a randomly initialized critic head

vmoens · 2025-02-06T16:31:12Z

examples/agents/ppo-chess-llm.py

+        return observation_spec
+
+
+def run_player(input_queue, output_queue):


oh wow that's clever! Didn't think it'd take that to use sunfish
Do you think we should upstream that in ChessEnv?

vmoens · 2025-02-06T16:42:28Z

examples/agents/ppo-chess-llm.py

+
+        for data in tqdm(rb):
+
+            data = gae(data)


gae should go above the tqdm(rb)

You need the data to be presented sequentially to apply gae - even slices may not be ideal (better to compute it once and for all than at each iter).

Also I would put a no_grad around it for safety.

this was something I wanted to talk about! if gae is before putting into replay buffer, we have no way to nicely append_transform to collector (I think)

vmoens · 2025-02-06T16:43:38Z

examples/agents/ppo-chess-llm.py

+    rb = ReplayBuffer(
+        storage=LazyStackStorage(100),
+        batch_size=48,
+        sampler=SliceSamplerWithoutReplacement(slice_len=8, end_key=("next", "done")),


in theory we may not need a slice sampler

vmoens · 2025-02-06T16:44:30Z

examples/agents/ppo-chess-llm.py

+        shifted=True,
+    )
+
+    for data in tqdm(collector):


usually there are 3 nested loops

for data in collector: for n in range(n_epochs): rb.extend(gae(data.copy())) for d in rb: loss(d) # etc

vmoens · 2025-02-06T16:46:57Z

examples/agents/ppo-chess-llm.py

+                hidden = output.hidden_states[-1][:, input_length - 1, :]
+                return log_prob, hidden
+            else:
+                while True:


Yeah I think it's good!
The while loop for generation is a bit worrisome - because then the samples are not really gathered from the LLM.
One way to do this could be to terminate the game if the output is not valid and assign a losing reward (like -1 or smth)

ppo chess with llm draft

85829fa

[ghstack-poisoned]

This was referenced Jan 27, 2025

[Example] Self-play chess PPO example #2709

Open

[Feature] ConditionalPolicySwitch transform #2711

Open

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 5, 2025

mikaylagawarecki added a commit that referenced this pull request Feb 5, 2025

ppo chess with llm draft

13cec3c

ghstack-source-id: a0f5b1468f42104c76996e53d037da3a67e80156 Pull Request resolved: #2763

Update on "ppo chess with llm draft"

90600e8

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Feb 5, 2025

ppo chess with llm draft

2ab0a09

ghstack-source-id: 8d779f57e748d0ff755b7cce76ca560aee4d8c6f Pull Request resolved: #2763

Update on "ppo chess with llm draft"

9f4f291

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Feb 5, 2025

ppo chess with llm draft

256a58c

ghstack-source-id: c1a108c14fe9f7a4f3c23a34272e9ae21c2eefa6 Pull Request resolved: #2763

mikaylagawarecki commented Feb 5, 2025

View reviewed changes

Update on "ppo chess with llm draft"

797486b

[ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Feb 5, 2025

ppo chess with llm draft

889e9fd

ghstack-source-id: 5b0ae766a774d3cfc78b3c5d1c07a3633b9e1345 Pull Request resolved: #2763

mikaylagawarecki changed the title ~~ppo chess with llm draft~~ [DRAFT] ppo chess with llm and ConditionalPolicySwitch to sunfish bot Feb 5, 2025

Update on "[DRAFT] ppo chess with llm and ConditionalPolicySwitch to …

7ee0011

…sunfish bot" [ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Feb 6, 2025

[skip-ci] ppo chess with llm draft

f3af37d

ghstack-source-id: ac573be6daed7607813f24db6e5706069a308718 Pull Request resolved: #2763

Update on "[DRAFT] ppo chess with llm and ConditionalPolicySwitch to …

f5d2b15

…sunfish bot" [ghstack-poisoned]

mikaylagawarecki added a commit that referenced this pull request Feb 6, 2025

[skip-ci] ppo chess with llm draft

06b6dd1

ghstack-source-id: 43ed58c8609054368e5170b980893b948266a65b Pull Request resolved: #2763

mikaylagawarecki commented Feb 6, 2025

View reviewed changes

vmoens reviewed Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] ppo chess with llm and ConditionalPolicySwitch to sunfish bot #2763

[DRAFT] ppo chess with llm and ConditionalPolicySwitch to sunfish bot #2763

mikaylagawarecki commented Feb 5, 2025 •

edited

Loading

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading

mikaylagawarecki Feb 5, 2025 •

edited

Loading

vmoens Feb 6, 2025

mikaylagawarecki Feb 5, 2025

mikaylagawarecki Feb 5, 2025

mikaylagawarecki Feb 5, 2025

mikaylagawarecki Feb 6, 2025

vmoens Feb 6, 2025

vmoens Feb 6, 2025

mikaylagawarecki Feb 6, 2025

vmoens Feb 6, 2025

vmoens Feb 6, 2025

vmoens Feb 6, 2025

		return observation_spec


		def run_player(input_queue, output_queue):

[DRAFT] ppo chess with llm and ConditionalPolicySwitch to sunfish bot #2763

Are you sure you want to change the base?

[DRAFT] ppo chess with llm and ConditionalPolicySwitch to sunfish bot #2763

Conversation

mikaylagawarecki commented Feb 5, 2025 • edited Loading

pytorch-bot bot commented Feb 5, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2763

❗ 1 Active SEVs

❌ 7 New Failures, 8 Unrelated Failures

mikaylagawarecki Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikaylagawarecki commented Feb 5, 2025 •

edited

Loading

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading

mikaylagawarecki Feb 5, 2025 •

edited

Loading