In this appendix we provide
- average runtime in minutes of the 100 hypertuning trials by algorithm and dataset
- the detailed search space utilized for hyperparameter tuning and the best hyperparameters found for each experiment group (composed by algorithm, training approach and dataset).
- Average runtime (in minutes) of the 100 hypertuning trials by algorithm and dataset
- Hypertuning Search Space
- Best Hyperparameters per Algorithm
- Baselines
- Transformers with only item id feature
- XLNET MLM with side information features
REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news | ||||||
---|---|---|---|---|---|---|---|---|---|
Number of sliding windows | 15 days | 90 days | 190 hours (~8 days) | 190 hours (~8 days) | |||||
Algorithm | Avg. | Std. Dev. | Avg. | Std. Dev. | Avg. | Std. Dev. | Avg. | Std. Dev. | |
Baselines | V-SkNN | 191.4 | 15.5 | 316.3 | 107.5 | 282.6 | 88.2 | 163.0 | 95.2 |
STAN | 221.3 | 30.0 | 378.4 | 55.6 | 101.2 | 12.9 | 92.6 | 13.2 | |
VSTAN | 293.3 | 59.0 | 412.2 | 55.9 | 128.7 | 20.1 | 105.8 | 17.1 | |
GRU4Rec (FT) | 163.0 | 17.5 | 756.3 | 258.3 | 173.1 | 31.4 | 145.1 | 31.6 | |
GRU4Rec (SWT) | 146.2 | 15.8 | 497.0 | 157.1 | 138.6 | 30.8 | 101.2 | 12.5 | |
GRU | 148.3 | 25.7 | 122.3 | 35.8 | 63.5 | 10.4 | 51.8 | 8.5 | |
Transformers with only the item id feature | GPT-2 (CLM) | 133.7 | 20.7 | 94.4 | 26.6 | 47.5 | 9.2 | 54.8 | 9.8 |
Transformer-XL (CLM) | 108.9 | 28.0 | 125.1 | 37.9 | 56.7 | 11.7 | 69.4 | 15.5 | |
ALBERT (MLM) | 116.8 | 36.0 | 116.1 | 33.8 | 67.1 | 16.9 | 59.2 | 17.6 | |
Electra (RTD) | 109.9 | 28.3 | 125.1 | 46.5 | 88.2 | 18.5 | 62.1 | 19.2 | |
XLNet (PLM) | 430.6 | 50.0 | 756.3 | 91.8 | 197.3 | 35.5 | 205.1 | 36.8 | |
XLNet (CLM) | 137.9 | 24.0 | 139.8 | 57.8 | 54.4 | 11.9 | 62.1 | 11.2 | |
XLNET(RTD) | 188.4 | 52.3 | 257.7 | 106.2 | 74.6 | 21.6 | 69.7 | 17.3 | |
XLNet (MLM) | 104.8 | 40.5 | 120.8 | 42.2 | 63.5 | 17.2 | 63.3 | 16.2 | |
Transformers with side information features | Concatenation merge | 142.7 | 42.8 | - | - | 66.9 | 17.7 | 70.1 | 20.7 |
Concatenation merge with numericals using Soft-One Hot Encoding | 173.9 | 46.5 | - | - | 69.0 | 21.4 | 80.8 | 19.8 | |
Element-wise merge | 127.7 | 26.2 | - | - | 66.8 | 15.4 | 56.3 | 16.4 |
Notes:
- Each hypertuning trial performs the full incremental training and evaluation pipeline for a number of sliding windows for each dataset, described in the first row of the spreadhseet
- All experiments were performed in a machine instance type with 8 CPU cores, 50 GB RAM and 1 V100 GPU with 32 GB.
- The training implementation of V-SkNN, STAN and VSTAN baselines is CPU-based; all other algorithms were trained on GPU. The evaluation of the Session k-NN methods and GRU4Rec was performed using CPU multi-processing, and all other algorithms were evaluated using GPU.
Table 2. Algorithms using the Transformers4Rec Meta-Architecture - Transformers and GRU baseline - using only the item id feature
Experiment Group | Type | Hyperparameter Name | Search space | Sampling Distribution |
---|---|---|---|---|
Common parameters | fixed | inp_merge | mlp | - |
input_features_aggregation | concat | - | ||
loss_type | cross_entropy | - | ||
model_type | gpt2, transfoxl, xlnet, albert, electra, gru (baseline) | - | ||
mf_constrained_embeddings | True | - | ||
per_device_eval_batch_size | 512 | - | ||
tf_out_activation | tanh | - | ||
similarity_type | concat_mlp) | - | ||
dataloader_drop_last | False | - | ||
dataloader_drop_last (for large ecommerce) | True | - | ||
compute_metrics_each_n_steps | 1 | - | ||
eval_on_last_item_seq_only | True | - | ||
learning_rate_schedule | linear_with_warmup | - | ||
learning_rate_warmup_steps | 0 | - | ||
layer_norm_all_features | False | - | ||
layer_norm_featurewise | True | - | ||
num_train_epochs | 10 | - | ||
session_seq_length_max | 20 | - | ||
hypertuning | d_model | [64,448] | int_uniform (step 64) | |
item_embedding_dim | [64,448] | int_uniform (step 64) | ||
n_layer | [1,4] | int_uniform | ||
n_head | [1, 2, 4, 8, 16] | categorical | ||
input_dropout | [0, 0.5] | discrete_uniform (step 0.1) | ||
discrete_uniform (step 0.1) | [0, 0,5] | discrete_uniform (step 0.1) | ||
learning_rate | [0.0001, 0.01] | log_uniform | ||
weight_decay | [0.000001, 0.001] | log_uniform | ||
per_device_train_batch_size | [128, 512] | int_uniform (steps 64) | ||
label_smoothing | [0, 0.9] | discrete_uniform (step 0.1) | ||
item_id_embeddings_init_std | [0.01, 0.15] | discrete_uniform (step 0.02) | ||
GRU | hypertuning | stochastic_shared_embeddings_replacement_prob | [0.0, 0.1] | discrete_uniform (step 0.02) |
GPT2 | hypertuning | stochastic_shared_embeddings_replacement_prob | [0.0, 0.1] | discrete_uniform (step 0.02) |
TransformerXL | hypertuning | stochastic_shared_embeddings_replacement_prob | [0.0, 0.1] | discrete_uniform (step 0.02) |
XLNet-CausalLM | fixed | attn_type | uni | - |
hypertuning | stochastic_shared_embeddings_replacement_prob | [0.0, 0.1] | discrete_uniform (step 0.02) | |
XLNet-MLM | fixed | mlm | True | - |
attn_type | bi | - | ||
hypertuning | stochastic_shared_embeddings_replacement_prob | [0.0, 0.1] | discrete_uniform (step 0.02) | |
mlm_probability | [0, 0.7] | discrete_uniform (step 0.1) | ||
XLNet-PLM | fixed | plm | True | - |
attn_type | bi | - | ||
plm_mask_input | False | - | ||
hypertuning | plm_probability (for ecommerce dataset) | [0, 0.7] | discrete_uniform (step 0.1) | |
plm_max_span_length (for ecommerce datasets) | [2, 6] | int_uniform | ||
plm_probability (for news datasets) | [0.4, 0.8] | discrete_uniform (step 0.1) | ||
plm_max_span_length (for news datasets) | [1, 4] | int_uniform | ||
Electra-RTD | fixed | rtd | True | - |
mlm | True | - | ||
rtd_tied_generator | True | - | ||
rtd_use_batch_interaction | False | - | ||
rtd_sample_from_batch | True | - | ||
hypertuning | rtd_discriminator_loss_weight | [1, 10, 20, 30, 40, 50] | categorical | |
mlm_probability | [0, 0.7] | discrete_uniform (step 0.1) | ||
ALBERT* | fixed | mlm | True | - |
inner_group_num | 1 | - | ||
num_hidden_groups | -1 | - | ||
hypertuning | stochastic_shared_embeddings_replacement_prob | [0.0, 0.1]] | discrete_uniform (step 0.02) | |
mlm_probability | [0, 0.7] | discrete_uniform (step 0.1) |
Experiment Group | Type | Hyperparameter Name | Search Space | Sampling Distribution |
---|---|---|---|---|
Common hyperparameters | fixed | layer_norm_all_features | FALSE | - |
fixed | layer_norm_featurewise | TRUE | - | |
hypertuning | other_embeddings_init_std | [0.005, 0.10] | discrete_uniform (step 0.005) | |
hypertuning | embedding_dim_from_cardinality_multiplier | [1.0, 10.0] | discrete_uniform (step 1.0) | |
Concatenation merge-Numericals features as scalars | fixed | input_features_aggregation | concat | - |
Concatenation merge-Numerical features-Soft One-Hot Encoding | fixed | input_features_aggregation | concat | - |
hypertuning | numeric_features_project_to_embedding_dim | [5, 55] | discrete_uniform (step 10) | |
hypertuning | numeric_features_soft_one_hot_encoding_num_embeddings | [5, 55] | discrete_uniform (step 10) | |
Element-wise merge | fixed | input_features_aggregation | elementwise_sum_multiply_item_embedding | - |
Experiment Group | Type | Hyperparameter Name | Search space | Sampling Distribution |
---|---|---|---|---|
Common parameters | fixed | model_type | gru4rec, vsknn, stan, vstan | - |
eval_on_last_item_seq_only | True | - | ||
session_seq_length_max | 20 | - | ||
GRU4REC | fixed | gru4rec-n_epochs | 10 | - |
no_incremental_training | True | - | ||
training_time_window_size (full-train) | 0 | - | ||
training_time_window_size (sliding 20%) | 20% of the length of the dataset | - | ||
hypertuning | gru4rec-batch_size | [128, 512] | init_uniform(step 64) | |
gru4rec-learning_rate | [0.0001, 0.1] | log_uniform | ||
gru4rec-dropout_p_hidden | [0, 0.5] | discrete_uniform (step 0.1) | ||
gru4rec-layers | [64,448] | int_uniform (step 64) | ||
gru4rec-embedding | [0,448] | int_uniform (step 64) | ||
gru4rec-constrained_embedding | [True, False] | categorical | ||
gru4rec-momentum | [0, 0.5] | float_uniform (step 0.01) | ||
gru4rec-final_act | [elu-0.5, linear, tanh] | categorical | ||
gru4rec-loss | [bpr-max, top1-max] | categorical | ||
V-SkNN | fixed | eval_baseline_cpu_parallel | True | - |
workers_count | 2 | - | ||
hypertuning | vsknn-k | [50, 1500] | init_uniform( step 50) | |
vsknn-sample_size | [500, 10000] | init_uniform( step 500) | ||
vsknn-weighting | [same, div, linear, quadratic, log] | categorical | ||
vsknn-weighting_score | [same, div, linear, quadratic, log] | categorical | ||
vsknn-idf_weighting | [1, 2, 5 ,10] | categorical | ||
vsknn-remind | [True, False] | categorical | ||
vsknn-push_reminders | [True, False] | categorical | ||
STAN | fixed | eval_baseline_cpu_parallel | True | - |
workers_count | 2 | - | ||
hypertuning | stan-k | [50, 2000] | init_uniform( step 50) | |
stan-sample_size | [500, 10000] | init_uniform( step 500) | ||
stan-lambda_spw | [0.00001, L/8, L/4, L/2, L, L*2]* | categorical | ||
stan-lambda_snh | [2.5, 5, 10, 20, 40, 80,100] | categorical | ||
stan-lambda_inh | [0.00001, L/8, L/4, L/2, L, L*2]* | categorical | ||
stan-remind | [True, False] | categorical | ||
VSTAN | fixed | eval_baseline_cpu_parallel | True | - |
workers_count | 2 | - | ||
hypertuning | vstan-k | [50, 2000] | init_uniform( step 50) | |
vstan-sample_size | [500, 10000] | init_uniform( step 500) | ||
vstan-lambda_spw | [0.00001, L/8, L/4, L/2, L, L*2]* | categorical | ||
vstan-lambda_snh | [2.5, 5, 10, 20, 40, 80,100] | categorical | ||
vstan-lambda_inh | [0.00001, L/8, L/4, L/2, L, L*2]* | categorical | ||
vstan-lambda_ipw | [0.00001, L/8, L/4, L/2, L, L*2]* | categorical | ||
vstan-lambda_idf | [1,2,5,10] | categorical | ||
vstan-remind | [True, False] | categorical |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
gru4rec-batch_size | 192 | 128 | 512 | 320 |
gru4rec-learning_rate | 0.02987583 | 0.04835963206 | 0.003390859922 | 0.006776399704 |
gru4rec-dropout_p_hidden | 0.2 | 0.3 | 0.4 | 0.1 |
gru4rec-layers | 384 | 320 | 448 | 448 |
gru4rec-embedding | 384 | 256 | 320 | 256 |
gru4rec-constrained_embedding | True | True | True | True |
gru4rec-momentum | 0.0063542217809 | 0.0240110233654 | 0.0033795343757 | 0.0227154672843 |
gru4rec-final_act | linear | linear | tanh | tanh |
gru4rec-loss | bpr-max | top1-max | bpr-max | top1-max |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
gru4rec-batch_size | 256 | 128 | 192 | 512 |
gru4rec-learning_rate | 0.09985796371 | 0.02551529973 | 0.003728175 | 0.006604778881 |
gru4rec-dropout_p_hidden | 0.0 | 0.2 | 0.3 | 0.1 |
gru4rec-layers | 320 | 384 | 384 | 448 |
gru4rec-embedding | 256 | 320 | 64 | 320 |
gru4rec-constrained_embedding | True | True | True | True |
gru4rec-momentum | 0.0080778576522 | 0.0141954218043 | 0.0235705315583 | 0.0131644109509 |
gru4rec-final_act | linear | linear | tanh | tanh |
gru4rec-loss | top1-max | bpr-max | top1-max | top1-max |
training_time_window_size | 6 | 36 | 72 | 72 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
vsknn-k | 600 | 500 | 800 | 1200 |
vsknn-sample_size | 2500 | 100 | 500 | 500 |
vsknn-weighting | same | quadratic | quadratic | quadratic |
vsknn-weighting_score | linear | quadratic | quadratic | quadratic |
vsknn-idf_weighting | 10 | 10 | False | False |
vsknn-remind | True | False | False | False |
vsknn-push_reminders | True | False | True | False |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
stan-k | 500 | 950 | 500 | 1850 |
stan-sample_size | 10000 | 8000 | 500 | 500 |
stan-lambda_spw | 5.49 | 1.00E-05 | 0.6725 | 0.355 |
stan-lambda_snh | 100 | 5 | 100 | 5 |
stan-lambda_inh | 1.3725 | 1.915 | 0.6725 | 0.71 |
stan-remind | True | False | False | False |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
vstan-k | 1300 | 450 | 1250 | 1300 |
vstan-sample_size | 8500 | 4500 | 500 | 1000 |
vstan-lambda_spw | 5.49 | 9.575E-01 | 2.69 | 0.355 |
vstan-lambda_snh | 80 | 5 | 80 | 100 |
vstan-lambda_inh | 2.745 | 3.83 | 1.345 | 0.355 |
vstan-lambda_ipw | 5.49 | 0.47875 | 0.33625 | 2.84 |
vstan-lambda_idf | 5 | 1 | False | False |
vstan-remind | True | False | False | False |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
stochastic_shared_embeddings_replacement_prob | 0.1 | 0.0 | 0.08 | 0.04 |
d_model | 128 | 192 | 128 | 320 |
item_embedding_dim | 384 | 448 | 448 | 384 |
n_layer | 1 | 1 | 1 | 1 |
input_dropout | 0.2 | 0.2 | 0.4 | 0.1 |
dropout | 0.0 | 0.3 | 0.1 | 0.3 |
learning_rate | 0.0007107976723 | 0.0003469143861 | 0.0006494976636 | 0.0003253950755 |
weight_decay | 4.01E-06 | 2.21E-06 | 6.17E-05 | 7.84E-05 |
per_device_train_batch_size | 448 | 384 | 192 | 192 |
label_smoothing | 0.3 | 0.5 | 0.7 | 0.9 |
item_id_embeddings_init_std | 0.09 | 0.15 | 0.11 | 0.11 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
stochastic_shared_embeddings_replacement_prob | 0.0 | 0.08 | 0.06 | 0.08 |
d_model | 128 | 192 | 256 | 64 |
item_embedding_dim | 448 | 448 | 448 | 448 |
n_layer | 1 | 2 | 1 | 1 |
n_head | 1 | 1 | 1 | 2 |
input_dropout | 0.4 | 0.3 | 0.0 | 0.1 |
dropout | 0.2 | 0.1 | 0.3 | 0.4 |
learning_rate | 0.0008781937894 | 0.0002622314826 | 0.0004451168156 | 0.000838438163 |
weight_decay | 1.49E-05 | 2.92E-06 | 5.64E-05 | 2.09E-05 |
per_device_train_batch_size | 384 | 320 | 320 | 192 |
label_smoothing | 0.9 | 0.2 | 0.2 | 0.3 |
item_id_embeddings_init_std | 0.03 | 0.05 | 0.11 | 0.07 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
stochastic_shared_embeddings_replacement_prob | 0.02 | 0.06 | 0.08 | 0.06 |
d_model | 448 | 256 | 128 | 320 |
item_embedding_dim | 320 | 320 | 448 | 448 |
n_layer | 1 | 1 | 1 | 2 |
n_head | 1 | 1 | 8 | 1 |
input_dropout | 0.3 | 0.0 | 0.2 | 0.4 |
dropout | 0.1 | 0.1 | 0 | 0 |
learning_rate | 0.001007765821 | 0.0005964244796 | 0.0003290060713 | 0.0001117800884 |
weight_decay | 1.07E-06 | 3.96E-06 | 1.73E-06 | 2.45E-05 |
per_device_train_batch_size | 512 | 512 | 192 | 128 |
label_smoothing | 0.2 | 0.8 | 0.3 | 0.1 |
item_id_embeddings_init_std | 0.15 | 0.09 | 0.03 | 0.15 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
attn_type | uni | uni | uni | uni |
stochastic_shared_embeddings_replacement_prob | 0.08 | 0.06 | 0.1 | 0.0 |
d_model | 320 | 448 | 128 | 384 |
item_embedding_dim | 448 | 384 | 448 | 448 |
n_layer | 1 | 2 | 1 | 1 |
n_head | 1 | 1 | 1 | 1 |
input_dropout | 0.0 | 0.1 | 0.1 | 0.4 |
dropout | 0.3 | 0.3 | 0.3 | 0.1 |
learning_rate | 0.002029182148 | 0.00117833948 | 0.002321720478 | 0.0002668717028 |
weight_decay | 1.52E-05 | 4.13E-06 | 8.18E-05 | 5.78E-06 |
per_device_train_batch_size | 192 | 384 | 320 | 192 |
label_smoothing | 0.1 | 0.6 | 0.3 | 0.3 |
item_id_embeddings_init_std | 0.13 | 0.09 | 0.13 | 0.13 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
attn_type | bi | bi | bi | bi |
stochastic_shared_embeddings_replacement_prob | 0.1 | 0 | 0.08 | 0 |
d_model | 192 | 320 | 384 | 384 |
item_embedding_dim | 448 | 448 | 384 | 384 |
n_layer | 3 | 2 | 4 | 3 |
n_head | 16 | 8 | 8 | 1 |
input_dropout | 0.1 | 0.3 | 0 | 0 |
dropout | 0 | 0 | 0 | 0.5 |
learning_rate | 0.0006667377133 | 0.0005427417425 | 0.0001426544717 | 0.000189558907 |
weight_decay | 3.91E-05 | 5.86E-06 | 8.09E-05 | 1.31E-05 |
per_device_train_batch_size | 192 | 384 | 128 | 192 |
label_smoothing | 0.0 | 0.6 | 0.3 | 0.2 |
item_id_embeddings_init_std | 0.11 | 0.09 | 0.15 | 0.15 |
mlm_probability | 0.3 | 0.3 | 0.3 | 0.2 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
attn_type | bi | bi | bi | bi |
stochastic_shared_embeddings_replacement_prob | 0.02 | 0 | 0 | 0 |
d_model | 384 | 320 | 256 | 256 |
item_embedding_dim | 384 | 448 | 448 | 448 |
n_layer | 4 | 1 | 1 | 1 |
n_head | 16 | 2 | 1 | 1 |
input_dropout | 0.2 | 0.1 | 0.2 | 0.3 |
dropout | 0 | 0 | 0.1 | 0.1 |
learning_rate | 0.0003387925502 | 0.0001934212295 | 0.0002623729053 | 2.32E-04 |
weight_decay | 2.18E-05 | 7.79E-06 | 1.33E-06 | 9.32E-05 |
per_device_train_batch_size | 320 | 384 | 192 | 192 |
label_smoothing | 0.7 | 0.5 | 0.8 | 0.2 |
item_id_embeddings_init_std | 0.13 | 0.11 | 0.07 | 0.11 |
plm_max_span_length | 3 | 4 | 4 | 2 |
plm_probability | 0.5 | 0.7 | 0.5 | 0.4 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
attn_type | bi | bi | bi | bi |
d_model | 384 | 384 | 448 | 448 |
item_embedding_dim | 448 | 448 | 384 | 448 |
n_layer | 3 | 4 | 2 | 4 |
n_head | 16 | 4 | 8 | 1 |
input_dropout | 0.2 | 0.3 | 0.2 | 0.4 |
dropout | 0.0 | 0.0 | 0.0 | 0.0 |
learning_rate | 0.0004549311269 | 0.0002805236563 | 0.0001523085576 | 1.76E-04 |
weight_decay | 7.70E-06 | 3.48E-06 | 5.40E-05 | 1.20E-06 |
per_device_train_batch_size | 384 | 320 | 192 | 256 |
label_smoothing | 0.2 | 0.3 | 0.3 | 0.2 |
item_id_embeddings_init_std | 0.15 | 0.11 | 0.15 | 0.09 |
mlm_probability | 0.5 | 0.3 | 0.5 | 0.3 |
rtd_discriminator_loss_weight | 1 | 1 | 1 | 1 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
stochastic_shared_embeddings_replacement_prob | 0 | 0 | 0 | 0 |
d_model | 384 | 384 | 320 | 256 |
item_embedding_dim | 448 | 320 | 448 | 320 |
n_layer | 2 | 2 | 4 | 3 |
n_head | 2 | 16 | 2 | 8 |
input_dropout | 0.1 | 0 | 0 | 0.4 |
dropout | 0 | 0 | 0 | 0 |
learning_rate | 0.0005122969429 | 0.0003369550189 | 0.0001436547301 | 1.76E-04 |
weight_decay | 8.20E-06 | 3.20E-06 | 1.88E-05 | 1.20E-06 |
per_device_train_batch_size | 320 | 320 | 128 | 256 |
label_smoothing | 0.5 | 0.8 | 0.5 | 0.3 |
item_id_embeddings_init_std | 0.09 | 0.09 | 0.15 | 0.05 |
rtd_discriminator_loss_weight | 1 | 1 | 1 | 1 |
mlm_probability | 0.4 | 0.2 | 0.3 | 0.3 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
stochastic_shared_embeddings_replacement_prob | 0.06 | 0.02 | 0.06 | 0.08 |
d_model | 320 | 448 | 384 | 192 |
item_embedding_dim | 320 | 448 | 384 | 448 |
n_layer | 2 | 4 | 4 | 4 |
n_head | 8 | 1 | 2 | 8 |
input_dropout | 0.1 | 0.1 | 0.2 | 0.2 |
dropout | 0.0 | 0.0 | 0.0 | 0.0 |
learning_rate | 0.0004904752786 | 0.0002907211377 | 0.0001896108995 | 1.90E-04 |
weight_decay | 9.57E-05 | 1.85E-06 | 1.63E-05 | 2.13E-05 |
per_device_train_batch_size | 192 | 512 | 128 | 192 |
label_smoothing | 0.2 | 0.3 | 0.7 | 0.2 |
item_id_embeddings_init_std | 0.11 | 0.07 | 0.15 | 0.15 |
mlm_probability | 0.6 | 0.3 | 0.2 | 0.4 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
attn_type | bi | - | bi | bi |
stochastic_shared_embeddings_replacement_prob | 0 | - | 0.02 | 0.02 |
d_model | 448.0 | - | 384 | 192 |
item_embedding_dim | 448 | - | 448 | 384 |
n_layer | 2 | - | 1 | 2 |
n_head | 8 | - | 8 | 4 |
input_dropout | 0.0 | - | 0 | 0.3 |
dropout | 0 | - | 0 | 0.00E+00 |
learning_rate | 2.02E-04 | - | 2.70E-04 | 3.43E-04 |
weight_decay | 2.75E-05 | - | 1.54E-05 | 5.88E-06 |
per_device_train_batch_size | 256 | - | 448 | 128 |
label_smoothing | 0.5 | - | 0.5 | 0.4 |
item_id_embeddings_init_std | 0.09 | - | 0.11 | 0.11 |
other_embeddings_init_std | 0.015 | - | 0.01 | 0.03 |
mlm_probability | 0.1 | - | 0.3 | 0.2 |
embedding_dim_from_cardinality_multiplier | 3 | - | 2 | 4 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
attn_type | bi | - | bi | bi |
stochastic_shared_embeddings_replacement_prob | 0 | - | 0 | 0.06 |
d_model | 448.0 | - | 128 | 256 |
item_embedding_dim | 384 | - | 384 | 320 |
n_layer | 2 | - | 4 | 1 |
n_head | 8 | - | 16 | 8 |
input_dropout | 0.1 | - | 0.0 | 0.2 |
dropout | 0 | - | 0 | 0.00E+00 |
learning_rate | 3.40E-04 | - | 2.21E-04 | 4.38E-04 |
weight_decay | 3.17E-05 | - | 6.47E-06 | 1.88E-05 |
per_device_train_batch_size | 256 | - | 128 | 128 |
label_smoothing | 0.6 | - | 0.4 | 0.9 |
item_id_embeddings_init_std | 0.07 | - | 0.13 | 0.13 |
other_embeddings_init_std | 0.085 | - | 0.08 | 0.06 |
mlm_probability | 0.3 | - | 0.2 | 0.4 |
embedding_dim_from_cardinality_multiplier | 1 | - | 1 | 7 |
numeric_features_project_to_embedding_dim | 20 | - | 10 | 20 |
numeric_features_soft_one_hot_encoding_num_embeddings | 5 | - | 20 | 20 |
Hyperparameters | REES46 eCommerce | YOOCHOOSE eCommerce | G1 news | ADRESSA news |
---|---|---|---|---|
attn_type | bi | - | bi | bi |
stochastic_shared_embeddings_replacement_prob | 0 | - | 0 | 0.08 |
d_model | 448.0 | - | 320 | 384 |
item_embedding_dim | 448 | - | 384 | 448 |
n_layer | 3 | - | 3 | 1 |
n_head | 16 | - | 8 | 8 |
input_dropout | 0.1 | - | 0.2 | 0.1 |
dropout | 0 | - | 0 | 0.00E+00 |
learning_rate | 3.81E-04 | - | 2.90E-04 | 2.01E-04 |
weight_decay | 3.16E-05 | - | 4.76E-06 | 2.15E-06 |
per_device_train_batch_size | 384 | - | 192 | 192 |
label_smoothing | 0.1 | - | 0.5 | 0.2 |
item_id_embeddings_init_std | 0.13 | - | 0.15 | 0.07 |
other_embeddings_init_std | 0.06 | - | 0.065 | 0.085 |
mlm_probability | 0.5 | - | 0.5 | 0.5 |
embedding_dim_from_cardinality_multiplier | 8 | - | 9 | 7 |