Add support for a simple preference UI #712

timbauman · 2023-05-11T00:15:29Z

Description

Testing

TODO: add notebook and experiment config that use this feature, and screenshots of behavior. (I've tested myself but not in a clean way.)

timbauman · 2023-05-11T00:27:02Z

src/imitation/util/video_wrapper.py

        self.video_recorder.capture_frame()
-        return res
+        # is it crazy to save the video path at every step?
+        info["video_path"] = self.get_current_video_path()


Probably best to only store this on the first step

The fragmenter will splice trajectories into multiple fragments, so storing it once per episode may not be enough (unless you make the fragmenter cleverer, and have it propagate the video path as needed). You can avoid extra memory by making sure self.get_current_video_path() returns the same object on successive calls if underlying path hasn't changed (some kind of memoization should work).

I think the issue I was worried about was that the trajectories would be stored on disk, in which case the video path would be duplicated a lot. Although maybe pickle is smart enough to make it a reference if it's the same object? IIRC str is immutable so it probably gets stored on disk as a value but Path probably won't be

AdamGleave

Thanks for implementing this! Looks like implementation is moving in right direction. Agreed we should add test case & a notebook demo. Left some relatively minor comments on the implementation so far.

src/imitation/algorithms/preference_comparisons.py

AdamGleave · 2023-05-11T02:28:50Z

src/imitation/algorithms/preference_comparisons.py

+        """
+
+        preferences = np.zeros(len(fragment_pairs), dtype=np.float32)
+        for i, (frag1, frag2) in enumerate(fragment_pairs):


Could also write as something like (not tested, may need to be modified):

preferences = [self._display_videos(**frag) for frag in fragment_pairs] preferences = np.astype(preferences, dtype=np.float32)

Not sure that's actually any better though -- more succinct but also a bit more magical. WDYT Tom & @levmckinney?

I don't super love using a list comprehension for UI because it obfuscates the fact that it's going to be slow and synchronous - a list comprehension feels functional to me which would imply that the order of operations doesn't matter

src/imitation/algorithms/preference_comparisons.py

AdamGleave · 2023-05-11T02:30:23Z

src/imitation/algorithms/preference_comparisons.py

+            raise ValueError(f"Unexpected input {pref}")
+        else:
+            print("Which video is preferred? (1 or 2, or q to quit, or r to replay):\n")
+            cap1 = cv2.VideoCapture(str(frag1_video_path))


The IPython and CV sections are largely independent except for initial path extraction -- consider moving this whole else: branch into a helper function analogous to self._display_videos_in_notebook?

Yeah I'll do that. Although the way I built it _display_videos_in_notebook only displays the videos, while the display and interaction code isn't cleanly separated in this code. (I could refactor to make it more modular and have the same shape but that feels like overkill.)

AdamGleave · 2023-05-11T02:35:23Z

src/imitation/algorithms/preference_comparisons.py

+                    ret1, frame1 = cap1.read()
+                    ret2, frame2 = cap2.read()
+                elif key == "1" or key == "2":
+                    cap1.release()


Do we always want these lines to be executed? If so, might want to move them to a finally: block to both ensure they always execute and cut duplication of lines 1026-1028.

Maybe I'm wrong but based on Googling it looks like finally can only be used for try/except blocks? Were you thinking I'd add a dummy try/except here?

Or for the KeyboardInterrupt case? TBH I think I don't need to run these functions in the KeyboardInterrupt case because it's going to kill the program anyway EDIT: I guess it's possible the KeyboardInterrupt is caught so nevermind

You don't need an except:, it can just be:

try: # your regular code finally: # code to always run

Not that important in this case but yes in general a variety of exceptions could occur (including KeyboardInterrupt) and be caught elsewhere in the code.

src/imitation/algorithms/preference_comparisons.py

AdamGleave · 2023-05-11T02:48:51Z

src/imitation/util/video_wrapper.py

        self.video_recorder.capture_frame()
-        return res
+        # is it crazy to save the video path at every step?
+        info["video_path"] = self.get_current_video_path()


The fragmenter will splice trajectories into multiple fragments, so storing it once per episode may not be enough (unless you make the fragmenter cleverer, and have it propagate the video path as needed). You can avoid extra memory by making sure self.get_current_video_path() returns the same object on successive calls if underlying path hasn't changed (some kind of memoization should work).

timbauman · 2023-05-11T16:59:41Z

FWIW I haven't cleaned this up at all so it's not quite ready for review - I probably should have included "[WIP]" on the title

timbauman · 2023-05-11T17:18:54Z

src/imitation/algorithms/preference_comparisons.py

+            print("Which video is preferred? (1 or 2, or q to quit, or r to replay):\n")
+            cap1 = cv2.VideoCapture(str(frag1_video_path))
+            cap2 = cv2.VideoCapture(str(frag2_video_path))
+            cv2.namedWindow("Video 1", cv2.WINDOW_NORMAL)


Is "video" the right word here? "Trajectory"? "Demonstration"?

timbauman · 2023-05-11T18:32:22Z

src/imitation/algorithms/preference_comparisons.py

@@ -906,6 +908,201 @@ def _reward_sums(self, fragment_pairs) -> Tuple[np.ndarray, np.ndarray]:
        return np.array(rews1, dtype=np.float32), np.array(rews2, dtype=np.float32)


+class SynchronousCLIGatherer(PreferenceGatherer):


It occurs to me that this isn't very testable right now

I'm going to leave this untested for now, since I have other tasks I need to attend to for now. 🤞 that I'll get to it later but I'm not sure I will

AdamGleave

Mostly looks good, some minor changes needed and addition of a test case.

@levmckinney reckon you could take this PR over the finish line?

AdamGleave · 2023-05-30T20:40:09Z

docs/tutorials/5b_train_preference_comparisons_with_synchronous_human_feedback.ipynb

+    "    rng=rng,\n",
+    ")\n",
+    "\n",
+    "# This gatherer will show the user (you!) pairs of trajectories and ask it to choose\n",


Suggested change

"# This gatherer will show the user (you!) pairs of trajectories and ask it to choose\n",

"# This gatherer will show the user (you!) pairs of trajectories and ask them to choose\n",

Most people's pronouns aren't "it" ;)

AdamGleave · 2023-05-30T20:43:17Z

docs/tutorials/5b_train_preference_comparisons_with_synchronous_human_feedback.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From this point onward, this notebook is the same as [the synthetic gatherer notebook](5_train_preference_comparisons.ipynb).\n",


Consider adding a comment in 5_train_preference_comparisons.ipynb to this effect so that we don't inadvertently make edits in that file that cause the two to drift out of sync?

AdamGleave · 2023-05-30T20:43:37Z

docs/tutorials/5b_train_preference_comparisons_with_synchronous_human_feedback.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we can train an agent, that only sees those learned reward."


Suggested change

"Now we can train an agent, that only sees those learned reward."

"Now we can train an agent, that only sees those learned rewards."

AdamGleave · 2023-05-30T20:54:53Z

src/imitation/algorithms/preference_comparisons.py

+                    ret1, frame1 = cap1.read()
+                    ret2, frame2 = cap2.read()
+                elif key == "1" or key == "2":
+                    cap1.release()


You don't need an except:, it can just be:

try: # your regular code finally: # code to always run

Not that important in this case but yes in general a variety of exceptions could occur (including KeyboardInterrupt) and be caught elsewhere in the code.

AdamGleave · 2023-05-30T20:56:09Z

src/imitation/algorithms/preference_comparisons.py

+
+        Args:
+            video_dir: directory where videos of the trajectories are saved.
+            video_width: width of the video in pixels.


I think this and video_height only have an effect when not running in IPython. Consider either documenting this or changing it to control width/height in IPython? (Or perhaps we can follow IPython default and have the CV windows width/height default to the size of the recorded videos.)

AdamGleave · 2023-06-01T01:00:18Z

src/imitation/algorithms/preference_comparisons.py

+
+    def _in_ipython(self) -> bool:
+        try:
+            return self.is_running_pytest_test() or get_ipython().__class__.__name__ == "ZMQInteractiveShell"  # type: ignore[name-defined] # noqa


Does this mean we never test _display_in_windows?

AdamGleave · 2023-06-01T01:13:10Z

src/imitation/scripts/train_preference_comparisons.py

@@ -166,7 +169,18 @@ def train_preference_comparisons(

    custom_logger, log_dir = logging_ingredient.setup_logging()

-    with environment.make_venv() as venv:
+    post_wrappers = (


Long expression is a bit hard to follow, perhaps:

post_wrappers = None if video_log_dir is not None: post_wrappers = [...]

AdamGleave · 2023-06-01T01:22:59Z

tests/algorithms/test_preference_comparisons.py

+                True,
+                np.array([1.0]),
+            ),
+            types.TrajectoryWithRew(


Is this identical to above? Can we assign it to a temporary variable e.g. traj and then rewrite trajectory_pairs = [(traj,traj)]?

AdamGleave · 2023-06-01T01:24:13Z

tests/algorithms/test_preference_comparisons.py

+        ),
+    ]
+
+    # this is the actual test


Can we test the "r" or "q" functionality somehow?

AdamGleave · 2023-06-01T01:24:47Z

tests/algorithms/test_preference_comparisons.py

+    mock_input.return_value = "2"
+    assert gatherer(trajectory_pairs) == np.array([0.0])
+
+


TODO: test the functionality using self._display_in_windows not just IPython? (Probably only worth doing if it's relatively easy, e.g. ~1 hour of work, otherwise we can just take the hit in code coverage.)

RedTachyon

I did a general code review, of the library changes, left a few comments.

A few more notes:

Right after running the tutorial, this is what I see:

It's probably because we're still running on gym==0.21.0 which uses pyglet for rendering, which is a ton of pain. Gymnasium uses pygame, so it might be easier to make headless, but I'm not 100% sure right now.
If we could remove the individual rendering for the learning envs, that would be great. The one way I see to do this is an env without rendering for learning, and then a separate one for rendering -- would this be easy to do? I think if it works, it'd be a good improvement to the user experience of the tutorial.

Maybe an option to skip a given sample? (in the 1-2-r-q choice, add s for skip) Sometimes it's so hard to tell that it feels better to not choose anything; not sure how well this plays with the algorithm (it should be alright to just get another sample, or remove that one from the dataset)
If possible, maybe put the videos side by side

This is in order of importance, and all of them are mainly just QoL improvements, depending on how much we want to spend on this tutorial. On the other hand, if we (or someone else) intend to use the UI for larger real workloads, they become somewhat more significant.

Also, for the sake of the tutorial, it'd be good to have some more commentary around each created object -- what's a reward net, what's a fragmenter, a preference model etc

RedTachyon · 2023-07-03T18:49:40Z

src/imitation/util/video_wrapper.py


    def step(self, action):
-        res = self.env.step(action)
+        obs, rew, done, info = self.env.step(action)


This will cause an issue during the switch to Gymnasium, which has a 5-element step return

RedTachyon · 2023-07-03T18:51:53Z

src/imitation/util/video_wrapper.py


    def close(self) -> None:
        if self.video_recorder is not None:
            self.video_recorder.close()
            self.video_recorder = None
+        if self.delete_on_close:
+            for path in self.directory.glob("*.mp4"):
+                path.unlink()


This might cause an unlikely failure case if one of the videos is deleted during the iteration, might be good to include a try-except clause

One more thing I noticed -- I don't think this will delete the directories created along the way? But we need to be careful in case someone points at "." and then it'd get deleted upon closing

RedTachyon · 2023-07-03T18:53:12Z

tests/algorithms/test_preference_comparisons.py

+@patch("builtins.input")
+@patch("IPython.display.display")
+def test_synchronous_human_gatherer(mock_display, mock_input):
+    del mock_display  # unused


Does this actually do anything? Manually deleting variables is rarely needed in python

RedTachyon · 2023-07-03T18:56:28Z

src/imitation/algorithms/preference_comparisons.py

+                ret1, frame1 = cap1.read()
+                ret2, frame2 = cap2.read()
+
+            key = chr(cv2.waitKey(1) & 0xFF)


Can we add a comment here for any future poor souls who have to wonder why there's a 0xFF in here?

AdamGleave · 2023-07-04T03:50:19Z

I did a general code review, of the library changes, left a few comments.
Thanks!

A few more notes:

Right after running the tutorial, this is what I see:

It's probably because we're still running on `gym==0.21.0` which uses pyglet for rendering, which is a ton of pain. Gymnasium uses pygame, so it might be easier to make headless, but I'm not 100% sure right now. If we could remove the individual rendering for the learning envs, that would be great. The one way I see to do this is an env without rendering for learning, and then a separate one for rendering -- would this be easy to do? I think if it works, it'd be a good improvement to the user experience of the tutorial.

Oh wow that is ugly! I don't think I saw this when I ran it, may be a version or OS difference.

I'm confused why the learning envs are rendering -- with Pendulum I thought the policy and reward took input in raw observations, and rendering was just used for the human preference comparisons?

Making it all headless obviously good if we're capturing the video anyway and showing it in the notebook. I'd be OK merging this PR without fixing this and just making an issue to track it -- it still seems better to have the tutorial than not.

Maybe an option to skip a given sample? (in the 1-2-r-q choice, add s for skip) Sometimes it's so hard to tell that it feels better to not choose anything; not sure how well this plays with the algorithm (it should be alright to just get another sample, or remove that one from the dataset)

If possible, maybe put the videos side by side

Yeah, I think the original DRLHP paper had an option for contractors to skip which just ignored the sample. Not that principled but works OK in practice.

This is in order of importance, and all of them are mainly just QoL improvements, depending on how much we want to spend on this tutorial. On the other hand, if we (or someone else) intend to use the UI for larger real workloads, they become somewhat more significant.

I think this is too toy to be used for real human experiments, #716 will be better for real workloads.

timokau · 2023-08-11T12:02:14Z

Oh wow that is ugly! I don't think I saw this when I ran it, may be a version or OS difference.

The windows can be avoided by using a virtual frame buffer. xvfb and it's python wrapper are useful for this.

codecov · 2024-09-09T09:59:46Z

Codecov Report

Attention: Patch coverage is 61.66667% with 46 lines in your changes missing coverage. Please review.

Project coverage is 95.82%. Comparing base (8d1bc3f) to head (f7b7a0d).
Report is 45 commits behind head on master.

Files with missing lines	Patch %	Lines
src/imitation/algorithms/preference_comparisons.py	47.67%	45 Missing ⚠️
...ion/scripts/config/train_preference_comparisons.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #712      +/-   ##
==========================================
- Coverage   96.29%   95.82%   -0.48%     
==========================================
  Files          91       91              
  Lines        8728     8836     +108     
==========================================
+ Hits         8405     8467      +62     
- Misses        323      369      +46

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

timbauman requested review from AdamGleave and levmckinney May 11, 2023 00:15

timbauman commented May 11, 2023

View reviewed changes

AdamGleave reviewed May 11, 2023

View reviewed changes

timbauman commented May 11, 2023

View reviewed changes

timbauman added 4 commits May 11, 2023 11:31

Add support for a simple preference UI

9208b89

style

97c2171

style

5182acc

comments

f176bdc

timbauman force-pushed the 711-support-simple-synchronous-cli-or-jupyter-human-preference-collection branch from bd31608 to f176bdc Compare May 11, 2023 18:31

timbauman commented May 11, 2023

View reviewed changes

timbauman added 5 commits May 11, 2023 14:02

add notebook

2540ba4

tutorial

3ff9579

lint

ce5efa7

lint

2ba2994

more lint

b3fd565

timbauman marked this pull request as ready for review May 12, 2023 04:29

timbauman added 3 commits May 11, 2023 21:30

oops

002146f

add test

9bb1f00

lint

f7b7a0d

ernestum mentioned this pull request May 15, 2023

Support human preferences in “Deep RL from human preferences” (RLHP) implementation #696

Open

mschweizer mentioned this pull request May 15, 2023

Support asynchronous human preference gathering in RLHP implementation #716

Open

AdamGleave reviewed Jun 1, 2023

View reviewed changes

RedTachyon reviewed Jul 3, 2023

View reviewed changes

timbauman closed this Sep 9, 2024

		@@ -906,6 +908,201 @@ def _reward_sums(self, fragment_pairs) -> Tuple[np.ndarray, np.ndarray]:
		return np.array(rews1, dtype=np.float32), np.array(rews2, dtype=np.float32)


		class SynchronousCLIGatherer(PreferenceGatherer):

	"# This gatherer will show the user (you!) pairs of trajectories and ask it to choose\n",
	"# This gatherer will show the user (you!) pairs of trajectories and ask them to choose\n",

	"Now we can train an agent, that only sees those learned reward."
	"Now we can train an agent, that only sees those learned rewards."

		mock_input.return_value = "2"
		assert gatherer(trajectory_pairs) == np.array([0.0])

Add support for a simple preference UI #712

Add support for a simple preference UI #712

Conversation

timbauman commented May 11, 2023

Description

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdamGleave left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timbauman May 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timbauman commented May 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdamGleave left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RedTachyon left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdamGleave commented Jul 4, 2023

timokau commented Aug 11, 2023

codecov bot commented Sep 9, 2024

Codecov Report

timbauman May 11, 2023 •

edited

Loading

RedTachyon left a comment •

edited

Loading