Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding the reward of sales promotion training dataset #10

Open
britisony opened this issue Feb 22, 2024 · 3 comments
Open

Comments

@britisony
Copy link

Hi,

In the sales promotion environment the reward is denoted by rew = (d_total_gmv - d_total_cost)/self.num_users which means the operator observes one single reward signal over all users. However, in the offline training dataset the reward is different for each user across 50 days. For example refer to the below user orders and reward graph
image
image

as per my understanding the reward should be same each day for the three users and gradually increase over 50 days with increase in sales. Could you kindly let me know how the reward in the training dataset was calculated.

@mzktbyjc2016
Copy link
Contributor

Hi, the calculation of each user's reward is the same, which is the based on $gmv-cost$. However, as the platform policy should consider the overall/average incomes, the reward for each user is set as the average reward for simplicity.

Alternately, the datasets actions are made by a human operator (after data anonymization), and we retain the original reward in the dataset for researchers with specific needs.

@britisony
Copy link
Author

britisony commented Feb 27, 2024

Thank you for your reply. Could you kindly let me know how the original reward was calculated. I require it to recalculate the reward based on variations in user orders. Also, for the provided environment I noticed the val_initial_states = np.load(os.path.join(dir, f'test_initial_states_10000_people.npy')) is not reset which causes the self.states to take different values every time the environment is reset after initialization. As an effect even with a deterministic action the reward function grows every time the environment is reset. For example please refer to the below code and the reward graph
image
episode_rewards0 6

Could you let me know if this is a bug or if there is a reason behind the environment design choice.

@mzktbyjc2016
Copy link
Contributor

Thanks for reporting this issue. This environment is originally designed for online evaluation, and thus some code are tailored to evaluation but not for training. We have locally fixed this reset issue for training, while that branch has not committed. This will sooner come with the newer sales promotion environment with budget constraint.

Currently, you can revise this line with deepcopy() as a quick fix, i.e., "self.states = deepcopy(self.val_initial_states)" in

self.states = self.val_initial_states
.

For "I require it to recalculate the reward based on variations in user orders", as mentioned above, the current sp_env does not support this calculation. You may need to use the raw order_number (the user network output) and gmv&cost data in

user_action = batch_user_action[index]

and
per_cost = (1 - avg_discount_ratio) * day_coupon_used_num * day_avg_fee
respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants