Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] After train, how to test our own environment? #306

Open
3 tasks done
zhou-ting-hub opened this issue Mar 4, 2024 · 16 comments
Open
3 tasks done

[Question] After train, how to test our own environment? #306

zhou-ting-hub opened this issue Mar 4, 2024 · 16 comments
Labels
question Further information is requested

Comments

@zhou-ting-hub
Copy link

Required prerequisites

Questions

Thank you for your work. After I successfully run the following train code

cd examples
python train_policy.py --algo CPO --env Custom0-v0

how to test next?

Method1:

image
we modified to omnisafe eval ./examples/runs/CPO-{Custom0-v0}

Method2: run the file ./examples/evaluate_saved_policy.py

LOG_DIR = /examples/runs/CPO-{Custom0-v0}/seed-000-2024-02-29-23-33-21

So which is right or how do is right to test the trained model, what is the difference between Method1 and Method2? Is the trained model saved in examples/runs/CPO-{Custom0-v0}/seed-000-2024-02-29-23-33-21? Thank you~

image

@zhou-ting-hub zhou-ting-hub added the question Further information is requested label Mar 4, 2024
@zhou-ting-hub zhou-ting-hub changed the title [Question] After train, how test our own environment? [Question] After train, how to test our own environment? Mar 4, 2024
@Gaiejj
Copy link
Member

Gaiejj commented Mar 5, 2024

Yes, the location where the experimental results are saved is examples/runs/CPO-{Custom0-v0}/seed-000-2024-02-29-23-33-21.
In fact, both methods of evaluating trained policies are fine. The omnisafe eval command provides more extensive command line information, making it more suitable for beginners to use simply; while examples/evaluate_saved_policy.py offers an interface at the code level for evaluation, facilitating customization for users. For example, using loop logic (e.g. for xxx) to iterate and evaluate all results in a specific folder.
If you encounter difficulties in the process of using these two methods for evaluation, feel free to continue providing feedback.

@zhou-ting-hub
Copy link
Author

zhou-ting-hub commented Mar 7, 2024

Thank you, to test, I also have tried the file ./examples/train_from_custom_dict.py and train_from_yaml.py, that has been solved, but I have another three problems:

Q1: When I use Method2: run the file ./examples/evaluate_saved_policy.py

LOG_DIR = /examples/runs/CPO-{Custom0-v0}/seed-000-2024-02-29-23-33-21

(1) For the same LOG_DIR, why are the results different? In theory, to evaluate the same model, the results shoule be the same.
(2) Does the evaluation use the current environment or just use the saved model in LOG_DIR?

Q2: To run the file ./examples/train_policy.py, we want to use gpu, also the environment are equipped with the torchgpu, but the erro is as the following:

1709804855092

Q3: In custom_env.py, in def step,

(1) For the cost function, our goal is to satisfy the equation limits self.P_d+self.P_EL+self.P_EB+self.P_ES=self.P_FC+self.P_PV+self.P_buy

Why the self.P_error is still very large after traing to convergence? I think the cost should tend to 0 in theory. Or how to design the cost?

(2) For the terminated and truncated function, the step should stop following truncated?

 self.iterations=95
 terminated=torch.as_tensor( self.current_step == self.iterations)
 truncated=torch.as_tensor(self.current_step > 92)

1709806471081

@Gaiejj
Copy link
Member

Gaiejj commented Mar 12, 2024

Q1
Yes, when we evaluate the trained policy, we only import the trained policy and use a randomly initialized environment. This will cause the results of each evaluation to be inconsistent. If you need the results of each evaluation to be consistent, you need to make the following change in omnisafe/evaluator.py:

SEED=5 # for example
from omnisafe.utils.tools import seed_all
seed_all(seed=SEED)

Then in the method __load_model_and_env, after making the env by self._env = make(**env_kwargs), add:

self._env.set_seed(seed=SEED)

Please note, that to ensure the rigor of the evaluation, use a different random seed for the evaluation than the one used during training.


Q2
Please additionally specify the GPU id. e.g. cuda:0


Q3

  • The original cost_limit is 25.00 in the algorithms .yaml file. You can try to set it to zero to obtain an expected result.
  • Sure, the environment will be stopped and reset after truncation.

@zhou-ting-hub
Copy link
Author

zhou-ting-hub commented Apr 5, 2024

THANK YOU VERY MUCH!
But we face a new problem, we run train_policy.py or train_from_custom_dict.py:
Why the train results is the same after we modified our environment under the premise of the same random seed settings in CPO.yaml?
For example, we modified the reward function in environment, the train results is the same with before; or modified some variables range in environment, the train results is just a little difference in range, the train results is the same trend with before.
In short, the training results did not further learn the modified environment but remained the original decision-making framework.
So it is depended on the seed? Should we set the random seed, but the seed is set a constant fixed value in CPO.yaml.

@Gaiejj
Copy link
Member

Gaiejj commented Apr 5, 2024

I believe this is due to the issue with the environment's random seed mechanism. The environment currently supported by OmniSafe is Safety-Gymnasium, which is based on Gymnasium, commonly used in the reinforcement learning community. In the random seed setting mechanism of Gymnasium, the environment generates a series of random numbers based on the initial random seed, which are used as seeds for subsequent resets, instead of using the same seed for every seed. For more details please refer to here: https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/utils/seeding.py

@zhou-ting-hub
Copy link
Author

Thank you, but In our environment, we reference the seed seeting in simple-env.py as following:

    def reset(
        self,
        seed: int | None = None,
        options: dict[str, Any] | None = None,
    ) -> tuple[torch.Tensor, dict]:
        if seed is not None:
            self.set_seed(seed)
        obs = torch.as_tensor(self._observation_space.sample())
        self._count = 0
        return obs, {}

    def set_seed(self, seed: int) -> None:
        random.seed(seed)

Also there is seed=0 in CPO.yaml, how it work?
If seed is not None: not run because the seed is None in def reset?
random.seed(seed) is the same meaning? when we change it into random.randint(1,10), no work.
so what should we modify the seed setting in our environment or in other place to achieve different random reaults.

image

@Gaiejj
Copy link
Member

Gaiejj commented Apr 7, 2024

You can set a simple seed logic, such as adding 10 each time. That is, when you first reset the environment, the seed you pass in is 0, and for subsequent resets, you only need to pass in None, allowing the environment to automatically reset with seeds 10, 20, 30, and so on. Similarly, when your initial seed is 5, the environment will automatically reset in the order of 15, 25, 35, and so on. This logic can be easily implemented in the reset function.

@zhou-ting-hub
Copy link
Author

zhou-ting-hub commented Apr 7, 2024

There are two places related to seed, so which seed plays the effect?

  1. In our environment, def reset() and def set _seed() is modified as following,
   def __init__(self):
       self._initial_seed = 0  
       self._current_seed = self._initial_seed  
   def reset(
       self,
       seed: int | None = None,
       options: dict[str, Any] | None = None,
    ) -> tuple[torch.Tensor, dict]:
              
       if seed is not None:
           self._current_seed = seed            
       else:
           self._current_seed += 10
       self.set_seed(self._current_seed)
       obs = torch.as_tensor(self._observation_space.sample())
       self._count = 0
       return obs, {}

   def set_seed(self, seed: int) -> None:
       random.seed(seed)

According your advice, we modified the above, as a result, we found that although seed is different in each episode such as seed=10,20,30,40,50 responding to five episodes, but when we run next time, seed is the same seed=10,20,30,40,50 , so the train results is still is "Episode reward: 17159.260873794556' that are the same as seed=None or other numbers. So we think the seed setting in def reset() does not play effect.

The following is related to the def set_seed(), but we do not find solve methods.
(1)In omnisafe\algorithms\base_algo.py,
self._init_env()
(2)In omnisafe\algorithms\on_policy\base\policy_gradient.py,

self._env: OnPolicyAdapter = OnPolicyAdapter(
            self._env_id,
            self._cfgs.train_cfgs.vector_env_nums,
            self._seed,
            self._cfgs,
        )

(3)In omnisafe\adapter\onpolicy_adapter.py,
super().__init__(env_id, num_envs, seed, cfgs)
(4)In omnisafe\adapter\online_adapter.py,
self._env.set_seed(seed)

  1. In omnisafe\algorithms\base_algo.py, it used cfgs.seed that is seed:0 in CPO.yaml
        assert hasattr(cfgs, 'seed'), 'Please specify the seed in the config file.'
        self._seed: int = int(cfgs.seed) + distributed.get_rank() * 1000
        seed_all(self._seed)

we modified the above as:

        assert hasattr(cfgs, 'seed'), 'Please specify the seed in the config file.'
        self._seed=random.randint(1,10)
        seed_all(self._seed)

or delete the seed_all(self._seed) as following:

        assert hasattr(cfgs, 'seed'), 'Please specify the seed in the config file.'
        self._seed: int = int(cfgs.seed) + distributed.get_rank() * 1000
     

we find that the train results can be differnt. We think the second seed place plays effect, that is the seed setting in CPO.yaml plays effect, is it right?

@Gaiejj
Copy link
Member

Gaiejj commented Apr 9, 2024

I think I need to clarify the meaning of the seed mechanism:

  • For multiple episodes of environment interactions in a single training session, each episode's environment random seed should differ. As previously mentioned, a simple logic can be used to reset the environment with a different random seed each time.
  • For multiple training sessions, the same seed should yield the same results. This is for the reproducibility of experiments.
    As for your question, CPO.yaml's seed actually is playing an effect. You can modify it to run in a different initial random seed.

@zhou-ting-hub
Copy link
Author

zhou-ting-hub commented Apr 9, 2024

Thank you, I unsderstand your meaning.

  • For multiple episodes of environment interactions in a single training session, each episode's environment random seed should differ. I think the seed is None can satisfy that. We do not need ro to reset the random seed with the above simple logic extraly.

  • For multiple training sessions, the same seed should yield the same results. I think that is for the same environment, but when we modify the environment (for example, we modify the reward function), the results is still the same (The Episode reward are just scaled up proportionally, and the value of decision varibles results is the same with before, without further learning based on the new reward function), this is my key problem. So I delete the seed_all(self._seed) in omnisafe\algorithms\base_algo.py, or modify the CPO.yaml's seed as you suggested, to achieve different results when next train. But this operation looks like do not solve the problem essentially, despite the value of decision varibles results is different, but it is similar with before. The new reward function do not play effect to further learning.

@zhou-ting-hub
Copy link
Author

zhou-ting-hub commented Apr 15, 2024

The training reward should increase, cost should decrease.

Q1: Why is reward in a downward trend? Our goal is to minimize economic costs, so we set a negative value, such as reward=-(self.price_e*self.P_buy+self.price_q*self.Q_buy)*1e-4

Q2: Does CPO only support one constraint? Our setting is cost=torch.as_tensor(max(max(0,self.Q_buy-Max_Q_buy),0-self.Q_buy)+max(max(0,self.P_buy-Max_P_buy),0-self.P_buy) , there are two constraints. Is this related to the decrease in rewards?

Thank you for your reply!

@Gaiejj
Copy link
Member

Gaiejj commented Apr 15, 2024

I'm sorry, but I'm not an expert in applying SafeRL to trading transactions. You need to focus on whether maximizing reward and minimizing cost can coexist simultaneously. For instance, in the Safety-Gymnasium supported by OmniSafe, specifically in SafetyPointGoal1-v0, maximizing reward (reaching the goal) and minimizing cost (avoiding collisions) can coexist, meaning the agent can choose a safe path to the goal. If the environment is designed to meet this condition, then it might be because the default parameters of CPO are not well suited to your task, and you can use examples/benchmarks/run_experiment_grid.py to search for the optimal hyperparameters.

OmniSafe's CPO currently does not support multiple constraints. You can try to handle this by summing up the two cost functions or taking their average, depending on their actual meanings.

@zhou-ting-hub
Copy link
Author

zhou-ting-hub commented Apr 26, 2024

Some problems about reward and cost learning curve:

we run the train_from_custom_dict.py, the epoch is set 4000, the results of agent.plot() and agent.render() are saved in D:\3omnisafe-main\examples\runs\CPO-{Custom0-v0}\seed-001-2024-04-26-17-29-33.

57dccecd0223dffad08c70a9a810647

Q1: In def plot() of C:\Users\zyt\.conda\envs\omnisafegpu\Lib\site-packages\omnisafe\algo_wrapper.py, the plot results zyt.png is saved in D:\3omnisafe-main\examples\runs\CPO-{Custom0-v0}\seed-001-2024-04-26-17-29-33\zyt.png

1714139242170 1714140059708 1714141566248

but when we run tensorboard --logdir D:\3omnisafe-main\examples\runs\CPO-{Custom0-v0}\seed-001-2024-04-26-17-29-33\tb , the tensorboard results are as following:

image

1714141697715

we find that convergence curves in tensorboard is the same with the progess.csv that reflects the training process as following:

image

1714142064373

In conclusion, we find that convergence curves in tensorboard is the same with the progess.csv that reflects the training process, but is different from the above zyt.png, so what doed zyt.png obtained by agent.plot() reflect , it is not the convergence curve?

Q2: In def render() of C:\Users\zyt\.conda\envs\omnisafegpu\Lib\site-packages\omnisafe\evaluator.py, the render results are saved in D:\3omnisafe-main\examples\runs\CPO-{Custom0-v0}\seed-001-2024-04-26-17-29-33\video,

1714140580696 1714140263078

Because we delete the output mp4 in save_video.py in Lib\site-packages\gymnasium\utils due to error, so we only obtain result.txt in video\epoch1000\epoch2000\epoch3000\epoch4000. what the mp4 refer to? It is necessray?

image

image

At the same time, we obtain 'myplot_multitimegpu.csv' , which is the def render() results in our environment .

image

In conclusion, we set the 'save_model_freq': 1000 in CPO.yaml, the trained model is saved in torch_save\epoch1000\epoch2000\epoch3000\epoch4000.pt, correspondingly,the render results are saved in video\epoch1000\epoch2000\epoch3000\epoch4000\result.txt , we find the render results in video\epoch4000\result.txt is terrible, so why the training curve in tensorboard (or in progress.csv) has been converged, but the reward and cost values in video\epoch4000\result.txt are terrible than the training curve?

**For example, the cost of training curve in tensorboard (or in progress.csv) has been converged to 60, but the cost of render results.txt is big as 17870 (correspondingly the obtained scheduling decision of energy power in myplot_multitimegpu.csv exceeds the limit setted in cost function, even the sum of exceeding value in one epoch is 17870 )? the reward of training curve in tensorboard (or in progress.csv) has been converged to -432, but the reward of render result.txt is small as -648? **

But when modified that to be 100, the render results are saved in video\epoch100\200...\3900\4000\result.txt, we find the render results in video\epoch4000\result.txt is still terrible, it looks like not related with 'save_model_freq'.

核心问题就是运行train_from_custom_dict.py里的agent.learn()训练训得很好,试过cost下降都能收敛到0,reward也上升收敛的很好,但是运行train_from_custom_dict.py里的agent.render(),获得的结果 video\epoch4000\result.txt和myplot_multitimegpu.csv都很差,不符合训练的收敛曲线,明显决策变量超出cost的约束很多。

除非用你之前教过的方法在环境的step里加上self.render(),边训练边render生成myplot_multitimegpu.csv,render出来的结果符合训练呈现出来的收敛曲线一致,cost能到0,决策变量都在约束以内,结果很好,但是这样很慢很慢。

agent.render()就是加载了训练agent.learn()过程中保存到torch_save\epoch4000.pt的模型呀,为啥训的好,render的差呀?所以(1)这个agent.render()算测试还是算对训练结果的render?(2)如果是测试,但是环境数据都没变,这也太差了(3)如果只是对训练结果的render,那如何测试,用agent.evaluate(),但agent.evaluate()也只能计算一个类似agent.render()的result.txt,并不能识别环境里的render,无法生成myplot_multitimegpu.csv。

多有打扰,十分感谢~

image

1714140652595

@zhou-ting-hub
Copy link
Author

Looking forward to your reply about the above problems, thank you~

@Gaiejj
Copy link
Member

Gaiejj commented May 10, 2024

I apologize for the late reply. I will address your questions one by one:
(1) The curve data in zyt.png is consistent with that in TensorBoard. As for the reason for the discrepancy in visual presentation, it is because TensorBoard automatically ignores excessively high outlier values. For instance, in the TensorBoard chart you showed, the EpCost scale is around 60-180; whereas in zyt.png, the EpCost scale is 0-14000. You only need to use the following code to set the display range for the axis in line 160 of omnisafe/utils/plotter.py.

sub_figures[1].set_ylim(COST_LOWER, COST_UPPER)
sub_figures[0].set_ylim(REWARD_LOWER,REWARD_UPPER)

(2)
I noticed that your main concern is why the agent's performance during render() is inconsistent with training. Addressing your three questions, here are my explanations:

a. The original design intention of render() is to visualize evaluation results, so it serves both as evaluation and visualization. Evaluation includes two aspects: 1. The agent generates actions using a deterministic strategy, not a random strategy. 2. The random seed in the agent's evaluation environment is different from that in the training environment.

b. If your evaluation results are very inconsistent with training, you might consider changing the deterministic strategy to a stochastic strategy, like:

act = self._actor.predict(
    obs.reshape(
        -1,
        obs.shape[-1],  # to make sure the shape is (1, obs_dim)
    ),
    deterministic=False,
).reshape(
    -1,  # to make sure the shape is (act_dim,)
)

or carefully check whether the environment imported during render() is consistent with the training environment.

@zhou-ting-hub
Copy link
Author

zhou-ting-hub commented May 14, 2024

Thank you, for (2)b, we modified that in def render() and def evaluate() in evaluator.py as follows, but run evaluator.render(num_episodes=1) and evaluator.evaluate(num_episodes=1) in evaluate_saved_policy.py , results is still very inconsistent with training.

1715661128284 1715661230195

and for (2)a, you said '' The random seed in the agent's evaluation environment is different from that in the training environment. "
we set the seed=0 in evaluator.py is the same as seed: 0 in CPO.YAML that used in training. but run evaluator.render(num_episodes=1) and evaluator.evaluate(num_episodes=1) in evaluate_saved_policy.py results are both terrible.

1715660617330(1) 1715660723029

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants