Question regarding the reward of sales promotion training dataset #10

britisony · 2024-02-22T07:22:30Z

Hi,

In the sales promotion environment the reward is denoted by rew = (d_total_gmv - d_total_cost)/self.num_users which means the operator observes one single reward signal over all users. However, in the offline training dataset the reward is different for each user across 50 days. For example refer to the below user orders and reward graph

as per my understanding the reward should be same each day for the three users and gradually increase over 50 days with increase in sales. Could you kindly let me know how the reward in the training dataset was calculated.

mzktbyjc2016 · 2024-02-26T06:33:02Z

Hi, the calculation of each user's reward is the same, which is the based on $gmv-cost$. However, as the platform policy should consider the overall/average incomes, the reward for each user is set as the average reward for simplicity.

Alternately, the datasets actions are made by a human operator (after data anonymization), and we retain the original reward in the dataset for researchers with specific needs.

britisony · 2024-02-27T05:03:35Z

Thank you for your reply. Could you kindly let me know how the original reward was calculated. I require it to recalculate the reward based on variations in user orders. Also, for the provided environment I noticed the val_initial_states = np.load(os.path.join(dir, f'test_initial_states_10000_people.npy')) is not reset which causes the self.states to take different values every time the environment is reset after initialization. As an effect even with a deterministic action the reward function grows every time the environment is reset. For example please refer to the below code and the reward graph

Could you let me know if this is a bug or if there is a reason behind the environment design choice.

mzktbyjc2016 · 2024-02-28T07:39:22Z

Thanks for reporting this issue. This environment is originally designed for online evaluation, and thus some code are tailored to evaluation but not for training. We have locally fixed this reset issue for training, while that branch has not committed. This will sooner come with the newer sales promotion environment with budget constraint.

Currently, you can revise this line with deepcopy() as a quick fix, i.e., "self.states = deepcopy(self.val_initial_states)" in

NeoRL/neorl/neorl_envs/SalesPromotion/sales_promo/env/marketing.py

Line 439 in a4b6c57

self.states = self.val_initial_states

.

For "I require it to recalculate the reward based on variations in user orders", as mentioned above, the current sp_env does not support this calculation. You may need to use the raw order_number (the user network output) and gmv&cost data in

NeoRL/neorl/neorl_envs/SalesPromotion/sales_promo/env/marketing.py

Line 161 in a4b6c57

user_action = batch_user_action[index]

and

NeoRL/neorl/neorl_envs/SalesPromotion/sales_promo/env/marketing.py

Line 426 in a4b6c57

per_cost = (1 - avg_discount_ratio) * day_coupon_used_num * day_avg_fee

respectively.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding the reward of sales promotion training dataset #10

Question regarding the reward of sales promotion training dataset #10

britisony commented Feb 22, 2024

mzktbyjc2016 commented Feb 26, 2024

britisony commented Feb 27, 2024 •

edited

Loading

mzktbyjc2016 commented Feb 28, 2024

Question regarding the reward of sales promotion training dataset #10

Question regarding the reward of sales promotion training dataset #10

Comments

britisony commented Feb 22, 2024

mzktbyjc2016 commented Feb 26, 2024

britisony commented Feb 27, 2024 • edited Loading

mzktbyjc2016 commented Feb 28, 2024

britisony commented Feb 27, 2024 •

edited

Loading