Skip to content

Latest commit

 

History

History
513 lines (465 loc) · 40.2 KB

Appendix_C-Hyperparameters.md

File metadata and controls

513 lines (465 loc) · 40.2 KB

Appendix C - Hypertuning - Search space and best hyperparameters

In this appendix we provide

  • average runtime in minutes of the 100 hypertuning trials by algorithm and dataset
  • the detailed search space utilized for hyperparameter tuning and the best hyperparameters found for each experiment group (composed by algorithm, training approach and dataset).

Table 1. Average runtime (in minutes) of the 100 hypertuning trials by algorithm and dataset

 REES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
 Number of sliding windows

15 days

90 days

190 hours (~8 days)190 hours (~8 days)
 AlgorithmAvg.Std. Dev.Avg.Std. Dev.Avg.Std. Dev.Avg.Std. Dev.
BaselinesV-SkNN191.415.5316.3107.5282.688.2163.095.2
STAN221.330.0378.455.6101.212.992.613.2
VSTAN293.359.0412.255.9128.720.1105.817.1
GRU4Rec (FT)163.017.5756.3258.3173.131.4145.131.6
GRU4Rec (SWT)146.215.8497.0157.1138.630.8101.212.5
GRU148.325.7122.335.863.510.451.88.5
Transformers with only the item id featureGPT-2 (CLM)133.720.794.426.647.59.254.89.8
Transformer-XL (CLM)108.928.0125.137.956.711.769.415.5
ALBERT (MLM)116.836.0116.133.867.116.959.217.6
Electra (RTD)109.928.3125.146.588.218.562.119.2
XLNet (PLM)430.650.0756.391.8197.335.5205.136.8
XLNet (CLM)137.924.0139.857.854.411.962.111.2
XLNET(RTD)188.452.3257.7106.274.621.669.717.3
XLNet (MLM)104.840.5120.842.263.517.263.316.2
Transformers with side information featuresConcatenation merge142.742.8--66.917.770.120.7
Concatenation merge with numericals using Soft-One Hot Encoding173.946.5--69.021.480.819.8
Element-wise merge127.726.2--66.815.456.316.4

Notes:

  • Each hypertuning trial performs the full incremental training and evaluation pipeline for a number of sliding windows for each dataset, described in the first row of the spreadhseet
  • All experiments were performed in a machine instance type with 8 CPU cores, 50 GB RAM and 1 V100 GPU with 32 GB.
  • The training implementation of V-SkNN, STAN and VSTAN baselines is CPU-based; all other algorithms were trained on GPU. The evaluation of the Session k-NN methods and GRU4Rec was performed using CPU multi-processing, and all other algorithms were evaluated using GPU.

Hypertuning Search Space

Table 2. Algorithms using the Transformers4Rec Meta-Architecture - Transformers and GRU baseline - using only the item id feature

Experiment Group Type Hyperparameter NameSearch spaceSampling Distribution
Common parameters fixedinp_mergemlp-
input_features_aggregationconcat-
loss_typecross_entropy-
model_typegpt2, transfoxl, xlnet, albert, electra, gru (baseline)-
mf_constrained_embeddingsTrue-
per_device_eval_batch_size512-
tf_out_activationtanh-
similarity_typeconcat_mlp)-
dataloader_drop_last False-
dataloader_drop_last (for large ecommerce)True-
compute_metrics_each_n_steps1-
eval_on_last_item_seq_onlyTrue-
learning_rate_schedulelinear_with_warmup-
learning_rate_warmup_steps0-
layer_norm_all_featuresFalse-
layer_norm_featurewiseTrue-
num_train_epochs10-
session_seq_length_max20-
hypertuningd_model[64,448]int_uniform (step 64)
item_embedding_dim[64,448]int_uniform (step 64)
n_layer[1,4]int_uniform
n_head[1, 2, 4, 8, 16]categorical
input_dropout[0, 0.5]discrete_uniform (step 0.1)
discrete_uniform (step 0.1)[0, 0,5]discrete_uniform (step 0.1)
learning_rate[0.0001, 0.01]log_uniform
weight_decay[0.000001, 0.001]log_uniform
per_device_train_batch_size[128, 512]int_uniform (steps 64)
label_smoothing[0, 0.9]discrete_uniform (step 0.1)
item_id_embeddings_init_std[0.01, 0.15]discrete_uniform (step 0.02)
GRUhypertuningstochastic_shared_embeddings_replacement_prob[0.0, 0.1]discrete_uniform (step 0.02)
GPT2hypertuningstochastic_shared_embeddings_replacement_prob[0.0, 0.1]discrete_uniform (step 0.02)
TransformerXLhypertuningstochastic_shared_embeddings_replacement_prob[0.0, 0.1]discrete_uniform (step 0.02)
XLNet-CausalLMfixedattn_typeuni-
hypertuningstochastic_shared_embeddings_replacement_prob[0.0, 0.1]discrete_uniform (step 0.02)
XLNet-MLMfixedmlmTrue-
attn_typebi-
hypertuningstochastic_shared_embeddings_replacement_prob[0.0, 0.1]discrete_uniform (step 0.02)
mlm_probability[0, 0.7]discrete_uniform (step 0.1)
XLNet-PLMfixedplmTrue-
attn_typebi-
plm_mask_inputFalse-
hypertuningplm_probability (for ecommerce dataset)[0, 0.7]discrete_uniform (step 0.1)
plm_max_span_length (for ecommerce datasets)[2, 6]int_uniform
plm_probability (for news datasets)[0.4, 0.8]discrete_uniform (step 0.1)
plm_max_span_length (for news datasets)[1, 4]int_uniform
Electra-RTDfixedrtdTrue-
mlmTrue-
rtd_tied_generatorTrue-
rtd_use_batch_interactionFalse-
rtd_sample_from_batchTrue-
hypertuningrtd_discriminator_loss_weight[1, 10, 20, 30, 40, 50]categorical
mlm_probability[0, 0.7]discrete_uniform (step 0.1)
ALBERT*fixedmlmTrue-
inner_group_num1-
num_hidden_groups-1-
hypertuningstochastic_shared_embeddings_replacement_prob[0.0, 0.1]]discrete_uniform (step 0.02)
mlm_probability[0, 0.7]discrete_uniform (step 0.1)
*In our experiments, we fixed the parameters “inner_group_num” and “num_hidden_groups” to 1 and -1, respectively. Under this configuration, the layers are not sharing the weights which is equivalent to BERT architecture.

Table 3. XLNet (MLM) - Additional hyperparameters when using side information

Experiment Group Type Hyperparameter NameSearch SpaceSampling Distribution
Common hyperparametersfixedlayer_norm_all_featuresFALSE-
fixedlayer_norm_featurewiseTRUE-
hypertuningother_embeddings_init_std[0.005, 0.10]discrete_uniform (step 0.005)
hypertuningembedding_dim_from_cardinality_multiplier[1.0, 10.0]discrete_uniform (step 1.0)
Concatenation merge-Numericals features as scalarsfixedinput_features_aggregationconcat-
Concatenation merge-Numerical features-Soft One-Hot Encodingfixedinput_features_aggregationconcat-
hypertuningnumeric_features_project_to_embedding_dim[5, 55]discrete_uniform (step 10)
hypertuningnumeric_features_soft_one_hot_encoding_num_embeddings[5, 55]discrete_uniform (step 10)
Element-wise mergefixedinput_features_aggregationelementwise_sum_multiply_item_embedding-

Table 4. Baselines

Experiment Group Type Hyperparameter NameSearch spaceSampling Distribution
Common parameters fixedmodel_typegru4rec, vsknn, stan, vstan-
eval_on_last_item_seq_onlyTrue-
session_seq_length_max20-
GRU4RECfixedgru4rec-n_epochs10-
no_incremental_trainingTrue -
training_time_window_size (full-train)0-
training_time_window_size (sliding 20%)20% of the length of the dataset -
hypertuninggru4rec-batch_size[128, 512]init_uniform(step 64)
gru4rec-learning_rate[0.0001, 0.1]log_uniform
gru4rec-dropout_p_hidden[0, 0.5]discrete_uniform (step 0.1)
gru4rec-layers[64,448]int_uniform (step 64)
gru4rec-embedding[0,448]int_uniform (step 64)
gru4rec-constrained_embedding[True, False]categorical
gru4rec-momentum[0, 0.5]float_uniform (step 0.01)
gru4rec-final_act[elu-0.5, linear, tanh]categorical
gru4rec-loss[bpr-max, top1-max]categorical
V-SkNNfixedeval_baseline_cpu_parallelTrue-
workers_count2-
hypertuning vsknn-k[50, 1500]init_uniform( step 50)
vsknn-sample_size [500, 10000]init_uniform( step 500)
vsknn-weighting[same, div, linear, quadratic, log]categorical
vsknn-weighting_score[same, div, linear, quadratic, log]categorical
vsknn-idf_weighting[1, 2, 5 ,10]categorical
vsknn-remind[True, False]categorical
vsknn-push_reminders[True, False]categorical
STANfixedeval_baseline_cpu_parallelTrue-
workers_count2-
hypertuningstan-k[50, 2000]init_uniform( step 50)
stan-sample_size [500, 10000]init_uniform( step 500)
stan-lambda_spw [0.00001, L/8, L/4, L/2, L, L*2]*categorical
stan-lambda_snh[2.5, 5, 10, 20, 40, 80,100]categorical
stan-lambda_inh[0.00001, L/8, L/4, L/2, L, L*2]*categorical
stan-remind[True, False]categorical
VSTANfixedeval_baseline_cpu_parallelTrue-
workers_count2-
hypertuning vstan-k[50, 2000]init_uniform( step 50)
vstan-sample_size [500, 10000]init_uniform( step 500)
vstan-lambda_spw [0.00001, L/8, L/4, L/2, L, L*2]*categorical
vstan-lambda_snh[2.5, 5, 10, 20, 40, 80,100]categorical
vstan-lambda_inh[0.00001, L/8, L/4, L/2, L, L*2]*categorical
vstan-lambda_ipw[0.00001, L/8, L/4, L/2, L, L*2]*categorical
vstan-lambda_idf[1,2,5,10]categorical
vstan-remind[True, False]categorical
* Where L is the average session length

Best Hyperparameters per Algorithm

Baselines

GRU4REC-FT

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
gru4rec-batch_size192128512320
gru4rec-learning_rate0.029875830.048359632060.0033908599220.006776399704
gru4rec-dropout_p_hidden0.20.30.40.1
gru4rec-layers384320448448
gru4rec-embedding384256320256
gru4rec-constrained_embeddingTrueTrueTrueTrue
gru4rec-momentum0.00635422178090.02401102336540.00337953437570.0227154672843
gru4rec-final_actlinearlineartanhtanh
gru4rec-lossbpr-max top1-max bpr-maxtop1-max

GRU4REC-SWT

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
gru4rec-batch_size256128192512
gru4rec-learning_rate0.099857963710.025515299730.0037281750.006604778881
gru4rec-dropout_p_hidden0.00.20.30.1
gru4rec-layers320384384448
gru4rec-embedding25632064320
gru4rec-constrained_embeddingTrueTrueTrueTrue
gru4rec-momentum0.00807785765220.01419542180430.02357053155830.0131644109509
gru4rec-final_actlinearlineartanhtanh
gru4rec-loss top1-maxbpr-max top1-maxtop1-max
training_time_window_size6367272

V-SkNN

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
vsknn-k6005008001200
vsknn-sample_size 2500100500500
vsknn-weightingsamequadraticquadraticquadratic
vsknn-weighting_scorelinearquadraticquadraticquadratic
vsknn-idf_weighting1010FalseFalse
vsknn-remindTrueFalseFalseFalse
vsknn-push_remindersTrueFalseTrueFalse

STAN

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
stan-k5009505001850
stan-sample_size 100008000500500
stan-lambda_spw 5.491.00E-050.67250.355
stan-lambda_snh10051005
stan-lambda_inh1.37251.9150.67250.71
stan-remindTrueFalseFalseFalse

VSTAN

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
vstan-k130045012501300
vstan-sample_size 850045005001000
vstan-lambda_spw 5.499.575E-012.690.355
vstan-lambda_snh80580100
vstan-lambda_inh2.7453.831.3450.355
vstan-lambda_ipw5.490.478750.336252.84
vstan-lambda_idf51FalseFalse
vstan-remindTrueFalseFalseFalse

Transformers with only item id feature

GRU

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
stochastic_shared_embeddings_replacement_prob0.10.00.080.04
d_model128192128320
item_embedding_dim384448448384
n_layer1111
input_dropout0.20.20.40.1
dropout0.00.30.10.3
learning_rate0.00071079767230.00034691438610.00064949766360.0003253950755
weight_decay4.01E-062.21E-066.17E-057.84E-05
per_device_train_batch_size448384192192
label_smoothing0.30.50.70.9
item_id_embeddings_init_std0.090.150.110.11

GPT2

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
stochastic_shared_embeddings_replacement_prob0.00.080.060.08
d_model12819225664
item_embedding_dim448448448448
n_layer1211
n_head1112
input_dropout0.40.30.00.1
dropout0.20.10.30.4
learning_rate0.00087819378940.00026223148260.00044511681560.000838438163
weight_decay1.49E-052.92E-065.64E-052.09E-05
per_device_train_batch_size384320320192
label_smoothing0.90.20.20.3
item_id_embeddings_init_std0.030.050.110.07

TransformerXL

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
stochastic_shared_embeddings_replacement_prob0.020.060.080.06
d_model448256128320
item_embedding_dim320320448448
n_layer1112
n_head1181
input_dropout0.30.00.20.4
dropout0.10.100
learning_rate0.0010077658210.00059642447960.00032900607130.0001117800884
weight_decay1.07E-063.96E-061.73E-062.45E-05
per_device_train_batch_size512512192128
label_smoothing0.20.80.30.1
item_id_embeddings_init_std0.150.090.030.15

XLNet-CausalLM

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
attn_typeuniuniuniuni
stochastic_shared_embeddings_replacement_prob0.080.060.10.0
d_model320448128384
item_embedding_dim448384448448
n_layer1211
n_head1111
input_dropout0.00.10.10.4
dropout0.30.30.30.1
learning_rate0.0020291821480.001178339480.0023217204780.0002668717028
weight_decay1.52E-054.13E-068.18E-055.78E-06
per_device_train_batch_size192384320192
label_smoothing0.10.60.30.3
item_id_embeddings_init_std0.130.090.130.13

XLNet-MLM

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
attn_typebibibibi
stochastic_shared_embeddings_replacement_prob0.100.080
d_model192320384384
item_embedding_dim448448384384
n_layer3243
n_head16881
input_dropout0.10.300
dropout0000.5
learning_rate0.00066673771330.00054274174250.00014265447170.000189558907
weight_decay3.91E-055.86E-068.09E-051.31E-05
per_device_train_batch_size192384128192
label_smoothing0.00.60.30.2
item_id_embeddings_init_std0.110.090.150.15
mlm_probability0.30.30.30.2

XLNet-PLM

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
attn_typebibibibi
stochastic_shared_embeddings_replacement_prob0.02000
d_model384320256256
item_embedding_dim384448448448
n_layer4111
n_head16211
input_dropout0.20.10.20.3
dropout000.10.1
learning_rate0.00033879255020.00019342122950.00026237290532.32E-04
weight_decay2.18E-057.79E-061.33E-069.32E-05
per_device_train_batch_size320384192192
label_smoothing0.70.50.80.2
item_id_embeddings_init_std0.130.110.070.11
plm_max_span_length3442
plm_probability0.50.70.50.4

XLNet-RTD

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
attn_typebibibibi
d_model384384448448
item_embedding_dim448448384448
n_layer3424
n_head16481
input_dropout0.20.30.20.4
dropout0.00.00.00.0
learning_rate0.00045493112690.00028052365630.00015230855761.76E-04
weight_decay7.70E-063.48E-065.40E-051.20E-06
per_device_train_batch_size384320192256
label_smoothing0.20.30.30.2
item_id_embeddings_init_std0.150.110.150.09
mlm_probability0.50.30.50.3
rtd_discriminator_loss_weight1111

ELECTRA

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
stochastic_shared_embeddings_replacement_prob0000
d_model384384320256
item_embedding_dim448320448320
n_layer2243
n_head21628
input_dropout0.1000.4
dropout0000
learning_rate0.00051229694290.00033695501890.00014365473011.76E-04
weight_decay8.20E-063.20E-061.88E-051.20E-06
per_device_train_batch_size320320128256
label_smoothing0.50.80.50.3
item_id_embeddings_init_std0.090.090.150.05
rtd_discriminator_loss_weight1111
mlm_probability0.40.20.30.3

ALBERT

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
stochastic_shared_embeddings_replacement_prob0.060.020.060.08
d_model320448384192
item_embedding_dim320448384448
n_layer2444
n_head8128
input_dropout0.10.10.20.2
dropout0.00.00.00.0
learning_rate0.00049047527860.00029072113770.00018961089951.90E-04
weight_decay9.57E-051.85E-061.63E-052.13E-05
per_device_train_batch_size192512128192
label_smoothing0.20.30.70.2
item_id_embeddings_init_std0.110.070.150.15
mlm_probability0.60.30.20.4

XLNET MLM with side information features

XLNet-MLM-all-concat

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
attn_typebi

-

bibi
stochastic_shared_embeddings_replacement_prob0

-

0.020.02
d_model448.0

-

384192
item_embedding_dim448

-

448384
n_layer2

-

12
n_head8

-

84
input_dropout0.0

-

00.3
dropout0

-

00.00E+00
learning_rate2.02E-04

-

2.70E-043.43E-04
weight_decay2.75E-05

-

1.54E-055.88E-06
per_device_train_batch_size256

-

448128
label_smoothing0.5

-

0.50.4
item_id_embeddings_init_std0.09

-

0.110.11
other_embeddings_init_std0.015

-

0.010.03
mlm_probability0.1

-

0.30.2
embedding_dim_from_cardinality_multiplier3

-

24

XLNet-MLM-all-concat-numeric_soft_embedding

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
attn_typebi

-

bibi
stochastic_shared_embeddings_replacement_prob0

-

00.06
d_model448.0

-

128256
item_embedding_dim384

-

384320
n_layer2

-

41
n_head8

-

168
input_dropout0.1

-

0.00.2
dropout0

-

00.00E+00
learning_rate3.40E-04

-

2.21E-044.38E-04
weight_decay3.17E-05

-

6.47E-061.88E-05
per_device_train_batch_size256

-

128128
label_smoothing0.6

-

0.40.9
item_id_embeddings_init_std0.07

-

0.130.13
other_embeddings_init_std0.085

-

0.080.06
mlm_probability0.3

-

0.20.4
embedding_dim_from_cardinality_multiplier1

-

17
numeric_features_project_to_embedding_dim20

-

1020
numeric_features_soft_one_hot_encoding_num_embeddings5

-

2020

XLNet-MLM-all-elementwise

HyperparametersREES46 eCommerceYOOCHOOSE eCommerceG1 newsADRESSA news
attn_typebi

-

bibi
stochastic_shared_embeddings_replacement_prob0

-

00.08
d_model448.0

-

320384
item_embedding_dim448

-

384448
n_layer3

-

31
n_head16

-

88
input_dropout0.1

-

0.20.1
dropout0

-

00.00E+00
learning_rate3.81E-04

-

2.90E-042.01E-04
weight_decay3.16E-05

-

4.76E-062.15E-06
per_device_train_batch_size384

-

192192
label_smoothing0.1

-

0.50.2
item_id_embeddings_init_std0.13

-

0.150.07
other_embeddings_init_std0.06

-

0.0650.085
mlm_probability0.5

-

0.50.5
embedding_dim_from_cardinality_multiplier8

-

97