Appendix C - Hypertuning - Search space and best hyperparameters

In this appendix we provide

average runtime in minutes of the 100 hypertuning trials by algorithm and dataset
the detailed search space utilized for hyperparameter tuning and the best hyperparameters found for each experiment group (composed by algorithm, training approach and dataset).

Average runtime (in minutes) of the 100 hypertuning trials by algorithm and dataset
Hypertuning Search Space
Best Hyperparameters per Algorithm
- Baselines
  - GRU4REC (FT)
  - GRU4REC (SWT)
  - V-SkNN
  - STAN
  - VSTAN
- Transformers with only item id feature
  - GRU
  - GPT2
  - TransformerXL
  - XLNet-CLM
  - XLNet-MLM
  - XLNet-PLM
  - XLNet-RTD
  - ELECTRA
  - ALBERT
- XLNET MLM with side information features

Table 1. Average runtime (in minutes) of the 100 hypertuning trials by algorithm and dataset

		REES46 eCommerce		YOOCHOOSE eCommerce		G1 news		ADRESSA news
	Number of sliding windows	15 days		90 days		190 hours (~8 days)		190 hours (~8 days)
	Algorithm	Avg.	Std. Dev.	Avg.	Std. Dev.	Avg.	Std. Dev.	Avg.	Std. Dev.
Baselines	V-SkNN	191.4	15.5	316.3	107.5	282.6	88.2	163.0	95.2
	STAN	221.3	30.0	378.4	55.6	101.2	12.9	92.6	13.2
	VSTAN	293.3	59.0	412.2	55.9	128.7	20.1	105.8	17.1
	GRU4Rec (FT)	163.0	17.5	756.3	258.3	173.1	31.4	145.1	31.6
	GRU4Rec (SWT)	146.2	15.8	497.0	157.1	138.6	30.8	101.2	12.5
	GRU	148.3	25.7	122.3	35.8	63.5	10.4	51.8	8.5
Transformers with only the item id feature	GPT-2 (CLM)	133.7	20.7	94.4	26.6	47.5	9.2	54.8	9.8
	Transformer-XL (CLM)	108.9	28.0	125.1	37.9	56.7	11.7	69.4	15.5
	ALBERT (MLM)	116.8	36.0	116.1	33.8	67.1	16.9	59.2	17.6
	Electra (RTD)	109.9	28.3	125.1	46.5	88.2	18.5	62.1	19.2
	XLNet (PLM)	430.6	50.0	756.3	91.8	197.3	35.5	205.1	36.8
	XLNet (CLM)	137.9	24.0	139.8	57.8	54.4	11.9	62.1	11.2
	XLNET(RTD)	188.4	52.3	257.7	106.2	74.6	21.6	69.7	17.3
	XLNet (MLM)	104.8	40.5	120.8	42.2	63.5	17.2	63.3	16.2
Transformers with side information features	Concatenation merge	142.7	42.8	-	-	66.9	17.7	70.1	20.7
	Concatenation merge with numericals using Soft-One Hot Encoding	173.9	46.5	-	-	69.0	21.4	80.8	19.8
	Element-wise merge	127.7	26.2	-	-	66.8	15.4	56.3	16.4

Notes:

Each hypertuning trial performs the full incremental training and evaluation pipeline for a number of sliding windows for each dataset, described in the first row of the spreadhseet
All experiments were performed in a machine instance type with 8 CPU cores, 50 GB RAM and 1 V100 GPU with 32 GB.
The training implementation of V-SkNN, STAN and VSTAN baselines is CPU-based; all other algorithms were trained on GPU. The evaluation of the Session k-NN methods and GRU4Rec was performed using CPU multi-processing, and all other algorithms were evaluated using GPU.

Hypertuning Search Space

Table 2. Algorithms using the Transformers4Rec Meta-Architecture - Transformers and GRU baseline - using only the item id feature

Experiment Group	Type	Hyperparameter Name	Search space	Sampling Distribution
Common parameters	fixed	inp_merge	mlp	-
		input_features_aggregation	concat	-
		loss_type	cross_entropy	-
		model_type	gpt2, transfoxl, xlnet, albert, electra, gru (baseline)	-
		mf_constrained_embeddings	True	-
		per_device_eval_batch_size	512	-
		tf_out_activation	tanh	-
		similarity_type	concat_mlp)	-
		dataloader_drop_last	False	-
		dataloader_drop_last (for large ecommerce)	True	-
		compute_metrics_each_n_steps	1	-
		eval_on_last_item_seq_only	True	-
		learning_rate_schedule	linear_with_warmup	-
		learning_rate_warmup_steps	0	-
		layer_norm_all_features	False	-
		layer_norm_featurewise	True	-
		num_train_epochs	10	-
		session_seq_length_max	20	-
	hypertuning	d_model	[64,448]	int_uniform (step 64)
		item_embedding_dim	[64,448]	int_uniform (step 64)
		n_layer	[1,4]	int_uniform
		n_head	[1, 2, 4, 8, 16]	categorical
		input_dropout	[0, 0.5]	discrete_uniform (step 0.1)
		discrete_uniform (step 0.1)	[0, 0,5]	discrete_uniform (step 0.1)
		learning_rate	[0.0001, 0.01]	log_uniform
		weight_decay	[0.000001, 0.001]	log_uniform
		per_device_train_batch_size	[128, 512]	int_uniform (steps 64)
		label_smoothing	[0, 0.9]	discrete_uniform (step 0.1)
		item_id_embeddings_init_std	[0.01, 0.15]	discrete_uniform (step 0.02)
		GRU	hypertuning	stochastic_shared_embeddings_replacement_prob	[0.0, 0.1]	discrete_uniform (step 0.02)
GPT2	hypertuning	stochastic_shared_embeddings_replacement_prob	[0.0, 0.1]	discrete_uniform (step 0.02)
TransformerXL	hypertuning	stochastic_shared_embeddings_replacement_prob	[0.0, 0.1]	discrete_uniform (step 0.02)
XLNet-CausalLM	fixed	attn_type	uni	-
XLNet-CausalLM	hypertuning	stochastic_shared_embeddings_replacement_prob	[0.0, 0.1]	discrete_uniform (step 0.02)
XLNet-MLM	fixed	mlm	True	-
	fixed	attn_type	bi	-
	hypertuning	stochastic_shared_embeddings_replacement_prob	[0.0, 0.1]	discrete_uniform (step 0.02)
	hypertuning	mlm_probability	[0, 0.7]	discrete_uniform (step 0.1)
XLNet-PLM	fixed	plm	True	-
		attn_type	bi	-
		plm_mask_input	False	-
	hypertuning	plm_probability (for ecommerce dataset)	[0, 0.7]	discrete_uniform (step 0.1)
		plm_max_span_length (for ecommerce datasets)	[2, 6]	int_uniform
		plm_probability (for news datasets)	[0.4, 0.8]	discrete_uniform (step 0.1)
		plm_max_span_length (for news datasets)	[1, 4]	int_uniform
Electra-RTD	fixed	rtd	True	-
		mlm	True	-
		rtd_tied_generator	True	-
		rtd_use_batch_interaction	False	-
		rtd_sample_from_batch	True	-
	hypertuning	rtd_discriminator_loss_weight	[1, 10, 20, 30, 40, 50]	categorical
	hypertuning	mlm_probability	[0, 0.7]	discrete_uniform (step 0.1)
ALBERT*	fixed	mlm	True	-
		inner_group_num	1	-
		num_hidden_groups	-1	-
	hypertuning	stochastic_shared_embeddings_replacement_prob	[0.0, 0.1]]	discrete_uniform (step 0.02)
	hypertuning	mlm_probability	[0, 0.7]	discrete_uniform (step 0.1)

*In our experiments, we fixed the parameters “inner_group_num” and “num_hidden_groups” to 1 and -1, respectively. Under this configuration, the layers are not sharing the weights which is equivalent to BERT architecture.

Table 3. XLNet (MLM) - Additional hyperparameters when using side information

Experiment Group	Type	Hyperparameter Name	Search Space	Sampling Distribution
Common hyperparameters	fixed	layer_norm_all_features	FALSE	-
	fixed	layer_norm_featurewise	TRUE	-
	hypertuning	other_embeddings_init_std	[0.005, 0.10]	discrete_uniform (step 0.005)
	hypertuning	embedding_dim_from_cardinality_multiplier	[1.0, 10.0]	discrete_uniform (step 1.0)
Concatenation merge-Numericals features as scalars	fixed	input_features_aggregation	concat	-
Concatenation merge-Numerical features-Soft One-Hot Encoding	fixed	input_features_aggregation	concat	-
	hypertuning	numeric_features_project_to_embedding_dim	[5, 55]	discrete_uniform (step 10)
	hypertuning	numeric_features_soft_one_hot_encoding_num_embeddings	[5, 55]	discrete_uniform (step 10)
Element-wise merge	fixed	input_features_aggregation	elementwise_sum_multiply_item_embedding	-

Table 4. Baselines

Experiment Group	Type	Hyperparameter Name	Search space	Sampling Distribution
Common parameters	fixed	model_type	gru4rec, vsknn, stan, vstan	-
		eval_on_last_item_seq_only	True	-
		session_seq_length_max	20	-
GRU4REC	fixed	gru4rec-n_epochs	10	-
		no_incremental_training	True	-
		training_time_window_size (full-train)	0	-
		training_time_window_size (sliding 20%)	20% of the length of the dataset	-
	hypertuning	gru4rec-batch_size	[128, 512]	init_uniform(step 64)
		gru4rec-learning_rate	[0.0001, 0.1]	log_uniform
		gru4rec-dropout_p_hidden	[0, 0.5]	discrete_uniform (step 0.1)
		gru4rec-layers	[64,448]	int_uniform (step 64)
		gru4rec-embedding	[0,448]	int_uniform (step 64)
		gru4rec-constrained_embedding	[True, False]	categorical
		gru4rec-momentum	[0, 0.5]	float_uniform (step 0.01)
		gru4rec-final_act	[elu-0.5, linear, tanh]	categorical
		gru4rec-loss	[bpr-max, top1-max]	categorical
V-SkNN	fixed	eval_baseline_cpu_parallel	True	-
	fixed	workers_count	2	-
	hypertuning	vsknn-k	[50, 1500]	init_uniform( step 50)
		vsknn-sample_size	[500, 10000]	init_uniform( step 500)
		vsknn-weighting	[same, div, linear, quadratic, log]	categorical
		vsknn-weighting_score	[same, div, linear, quadratic, log]	categorical
		vsknn-idf_weighting	[1, 2, 5 ,10]	categorical
		vsknn-remind	[True, False]	categorical
		vsknn-push_reminders	[True, False]	categorical
STAN	fixed	eval_baseline_cpu_parallel	True	-
	fixed	workers_count	2	-
	hypertuning	stan-k	[50, 2000]	init_uniform( step 50)
		stan-sample_size	[500, 10000]	init_uniform( step 500)
		stan-lambda_spw	[0.00001, L/8, L/4, L/2, L, L2]	categorical
		stan-lambda_snh	[2.5, 5, 10, 20, 40, 80,100]	categorical
		stan-lambda_inh	[0.00001, L/8, L/4, L/2, L, L2]	categorical
		stan-remind	[True, False]	categorical
VSTAN	fixed	eval_baseline_cpu_parallel	True	-
	fixed	workers_count	2	-
	hypertuning	vstan-k	[50, 2000]	init_uniform( step 50)
		vstan-sample_size	[500, 10000]	init_uniform( step 500)
		vstan-lambda_spw	[0.00001, L/8, L/4, L/2, L, L2]	categorical
		vstan-lambda_snh	[2.5, 5, 10, 20, 40, 80,100]	categorical
		vstan-lambda_inh	[0.00001, L/8, L/4, L/2, L, L2]	categorical
		vstan-lambda_ipw	[0.00001, L/8, L/4, L/2, L, L2]	categorical
		vstan-lambda_idf	[1,2,5,10]	categorical
		vstan-remind	[True, False]	categorical

* Where L is the average session length

Best Hyperparameters per Algorithm

Baselines

GRU4REC-FT

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
gru4rec-batch_size	192	128	512	320
gru4rec-learning_rate	0.02987583	0.04835963206	0.003390859922	0.006776399704
gru4rec-dropout_p_hidden	0.2	0.3	0.4	0.1
gru4rec-layers	384	320	448	448
gru4rec-embedding	384	256	320	256
gru4rec-constrained_embedding	True	True	True	True
gru4rec-momentum	0.0063542217809	0.0240110233654	0.0033795343757	0.0227154672843
gru4rec-final_act	linear	linear	tanh	tanh
gru4rec-loss	bpr-max	top1-max	bpr-max	top1-max

GRU4REC-SWT

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
gru4rec-batch_size	256	128	192	512
gru4rec-learning_rate	0.09985796371	0.02551529973	0.003728175	0.006604778881
gru4rec-dropout_p_hidden	0.0	0.2	0.3	0.1
gru4rec-layers	320	384	384	448
gru4rec-embedding	256	320	64	320
gru4rec-constrained_embedding	True	True	True	True
gru4rec-momentum	0.0080778576522	0.0141954218043	0.0235705315583	0.0131644109509
gru4rec-final_act	linear	linear	tanh	tanh
gru4rec-loss	top1-max	bpr-max	top1-max	top1-max
training_time_window_size	6	36	72	72

V-SkNN

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
vsknn-k	600	500	800	1200
vsknn-sample_size	2500	100	500	500
vsknn-weighting	same	quadratic	quadratic	quadratic
vsknn-weighting_score	linear	quadratic	quadratic	quadratic
vsknn-idf_weighting	10	10	False	False
vsknn-remind	True	False	False	False
vsknn-push_reminders	True	False	True	False

STAN

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
stan-k	500	950	500	1850
stan-sample_size	10000	8000	500	500
stan-lambda_spw	5.49	1.00E-05	0.6725	0.355
stan-lambda_snh	100	5	100	5
stan-lambda_inh	1.3725	1.915	0.6725	0.71
stan-remind	True	False	False	False

VSTAN

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
vstan-k	1300	450	1250	1300
vstan-sample_size	8500	4500	500	1000
vstan-lambda_spw	5.49	9.575E-01	2.69	0.355
vstan-lambda_snh	80	5	80	100
vstan-lambda_inh	2.745	3.83	1.345	0.355
vstan-lambda_ipw	5.49	0.47875	0.33625	2.84
vstan-lambda_idf	5	1	False	False
vstan-remind	True	False	False	False

Transformers with only item id feature

GRU

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
stochastic_shared_embeddings_replacement_prob	0.1	0.0	0.08	0.04
d_model	128	192	128	320
item_embedding_dim	384	448	448	384
n_layer	1	1	1	1
input_dropout	0.2	0.2	0.4	0.1
dropout	0.0	0.3	0.1	0.3
learning_rate	0.0007107976723	0.0003469143861	0.0006494976636	0.0003253950755
weight_decay	4.01E-06	2.21E-06	6.17E-05	7.84E-05
per_device_train_batch_size	448	384	192	192
label_smoothing	0.3	0.5	0.7	0.9
item_id_embeddings_init_std	0.09	0.15	0.11	0.11

GPT2

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
stochastic_shared_embeddings_replacement_prob	0.0	0.08	0.06	0.08
d_model	128	192	256	64
item_embedding_dim	448	448	448	448
n_layer	1	2	1	1
n_head	1	1	1	2
input_dropout	0.4	0.3	0.0	0.1
dropout	0.2	0.1	0.3	0.4
learning_rate	0.0008781937894	0.0002622314826	0.0004451168156	0.000838438163
weight_decay	1.49E-05	2.92E-06	5.64E-05	2.09E-05
per_device_train_batch_size	384	320	320	192
label_smoothing	0.9	0.2	0.2	0.3
item_id_embeddings_init_std	0.03	0.05	0.11	0.07

TransformerXL

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
stochastic_shared_embeddings_replacement_prob	0.02	0.06	0.08	0.06
d_model	448	256	128	320
item_embedding_dim	320	320	448	448
n_layer	1	1	1	2
n_head	1	1	8	1
input_dropout	0.3	0.0	0.2	0.4
dropout	0.1	0.1	0	0
learning_rate	0.001007765821	0.0005964244796	0.0003290060713	0.0001117800884
weight_decay	1.07E-06	3.96E-06	1.73E-06	2.45E-05
per_device_train_batch_size	512	512	192	128
label_smoothing	0.2	0.8	0.3	0.1
item_id_embeddings_init_std	0.15	0.09	0.03	0.15

XLNet-CausalLM

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
attn_type	uni	uni	uni	uni
stochastic_shared_embeddings_replacement_prob	0.08	0.06	0.1	0.0
d_model	320	448	128	384
item_embedding_dim	448	384	448	448
n_layer	1	2	1	1
n_head	1	1	1	1
input_dropout	0.0	0.1	0.1	0.4
dropout	0.3	0.3	0.3	0.1
learning_rate	0.002029182148	0.00117833948	0.002321720478	0.0002668717028
weight_decay	1.52E-05	4.13E-06	8.18E-05	5.78E-06
per_device_train_batch_size	192	384	320	192
label_smoothing	0.1	0.6	0.3	0.3
item_id_embeddings_init_std	0.13	0.09	0.13	0.13

XLNet-MLM

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
attn_type	bi	bi	bi	bi
stochastic_shared_embeddings_replacement_prob	0.1	0	0.08	0
d_model	192	320	384	384
item_embedding_dim	448	448	384	384
n_layer	3	2	4	3
n_head	16	8	8	1
input_dropout	0.1	0.3	0	0
dropout	0	0	0	0.5
learning_rate	0.0006667377133	0.0005427417425	0.0001426544717	0.000189558907
weight_decay	3.91E-05	5.86E-06	8.09E-05	1.31E-05
per_device_train_batch_size	192	384	128	192
label_smoothing	0.0	0.6	0.3	0.2
item_id_embeddings_init_std	0.11	0.09	0.15	0.15
mlm_probability	0.3	0.3	0.3	0.2

XLNet-PLM

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
attn_type	bi	bi	bi	bi
stochastic_shared_embeddings_replacement_prob	0.02	0	0	0
d_model	384	320	256	256
item_embedding_dim	384	448	448	448
n_layer	4	1	1	1
n_head	16	2	1	1
input_dropout	0.2	0.1	0.2	0.3
dropout	0	0	0.1	0.1
learning_rate	0.0003387925502	0.0001934212295	0.0002623729053	2.32E-04
weight_decay	2.18E-05	7.79E-06	1.33E-06	9.32E-05
per_device_train_batch_size	320	384	192	192
label_smoothing	0.7	0.5	0.8	0.2
item_id_embeddings_init_std	0.13	0.11	0.07	0.11
plm_max_span_length	3	4	4	2
plm_probability	0.5	0.7	0.5	0.4

XLNet-RTD

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
attn_type	bi	bi	bi	bi
d_model	384	384	448	448
item_embedding_dim	448	448	384	448
n_layer	3	4	2	4
n_head	16	4	8	1
input_dropout	0.2	0.3	0.2	0.4
dropout	0.0	0.0	0.0	0.0
learning_rate	0.0004549311269	0.0002805236563	0.0001523085576	1.76E-04
weight_decay	7.70E-06	3.48E-06	5.40E-05	1.20E-06
per_device_train_batch_size	384	320	192	256
label_smoothing	0.2	0.3	0.3	0.2
item_id_embeddings_init_std	0.15	0.11	0.15	0.09
mlm_probability	0.5	0.3	0.5	0.3
rtd_discriminator_loss_weight	1	1	1	1

ELECTRA

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
stochastic_shared_embeddings_replacement_prob	0	0	0	0
d_model	384	384	320	256
item_embedding_dim	448	320	448	320
n_layer	2	2	4	3
n_head	2	16	2	8
input_dropout	0.1	0	0	0.4
dropout	0	0	0	0
learning_rate	0.0005122969429	0.0003369550189	0.0001436547301	1.76E-04
weight_decay	8.20E-06	3.20E-06	1.88E-05	1.20E-06
per_device_train_batch_size	320	320	128	256
label_smoothing	0.5	0.8	0.5	0.3
item_id_embeddings_init_std	0.09	0.09	0.15	0.05
rtd_discriminator_loss_weight	1	1	1	1
mlm_probability	0.4	0.2	0.3	0.3

ALBERT

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
stochastic_shared_embeddings_replacement_prob	0.06	0.02	0.06	0.08
d_model	320	448	384	192
item_embedding_dim	320	448	384	448
n_layer	2	4	4	4
n_head	8	1	2	8
input_dropout	0.1	0.1	0.2	0.2
dropout	0.0	0.0	0.0	0.0
learning_rate	0.0004904752786	0.0002907211377	0.0001896108995	1.90E-04
weight_decay	9.57E-05	1.85E-06	1.63E-05	2.13E-05
per_device_train_batch_size	192	512	128	192
label_smoothing	0.2	0.3	0.7	0.2
item_id_embeddings_init_std	0.11	0.07	0.15	0.15
mlm_probability	0.6	0.3	0.2	0.4

XLNET MLM with side information features

XLNet-MLM-all-concat

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
attn_type	bi	-	bi	bi
stochastic_shared_embeddings_replacement_prob	0	-	0.02	0.02
d_model	448.0	-	384	192
item_embedding_dim	448	-	448	384
n_layer	2	-	1	2
n_head	8	-	8	4
input_dropout	0.0	-	0	0.3
dropout	0	-	0	0.00E+00
learning_rate	2.02E-04	-	2.70E-04	3.43E-04
weight_decay	2.75E-05	-	1.54E-05	5.88E-06
per_device_train_batch_size	256	-	448	128
label_smoothing	0.5	-	0.5	0.4
item_id_embeddings_init_std	0.09	-	0.11	0.11
other_embeddings_init_std	0.015	-	0.01	0.03
mlm_probability	0.1	-	0.3	0.2
embedding_dim_from_cardinality_multiplier	3	-	2	4

XLNet-MLM-all-concat-numeric_soft_embedding

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
attn_type	bi	-	bi	bi
stochastic_shared_embeddings_replacement_prob	0	-	0	0.06
d_model	448.0	-	128	256
item_embedding_dim	384	-	384	320
n_layer	2	-	4	1
n_head	8	-	16	8
input_dropout	0.1	-	0.0	0.2
dropout	0	-	0	0.00E+00
learning_rate	3.40E-04	-	2.21E-04	4.38E-04
weight_decay	3.17E-05	-	6.47E-06	1.88E-05
per_device_train_batch_size	256	-	128	128
label_smoothing	0.6	-	0.4	0.9
item_id_embeddings_init_std	0.07	-	0.13	0.13
other_embeddings_init_std	0.085	-	0.08	0.06
mlm_probability	0.3	-	0.2	0.4
embedding_dim_from_cardinality_multiplier	1	-	1	7
numeric_features_project_to_embedding_dim	20	-	10	20
numeric_features_soft_one_hot_encoding_num_embeddings	5	-	20	20

XLNet-MLM-all-elementwise

Hyperparameters	REES46 eCommerce	YOOCHOOSE eCommerce	G1 news	ADRESSA news
attn_type	bi	-	bi	bi
stochastic_shared_embeddings_replacement_prob	0	-	0	0.08
d_model	448.0	-	320	384
item_embedding_dim	448	-	384	448
n_layer	3	-	3	1
n_head	16	-	8	8
input_dropout	0.1	-	0.2	0.1
dropout	0	-	0	0.00E+00
learning_rate	3.81E-04	-	2.90E-04	2.01E-04
weight_decay	3.16E-05	-	4.76E-06	2.15E-06
per_device_train_batch_size	384	-	192	192
label_smoothing	0.1	-	0.5	0.2
item_id_embeddings_init_std	0.13	-	0.15	0.07
other_embeddings_init_std	0.06	-	0.065	0.085
mlm_probability	0.5	-	0.5	0.5
embedding_dim_from_cardinality_multiplier	8	-	9	7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appendix_C-Hyperparameters.md

Appendix_C-Hyperparameters.md

Appendix C - Hypertuning - Search space and best hyperparameters

Table 1. Average runtime (in minutes) of the 100 hypertuning trials by algorithm and dataset

Hypertuning Search Space

Table 2. Algorithms using the Transformers4Rec Meta-Architecture - Transformers and GRU baseline - using only the item id feature

Table 3. XLNet (MLM) - Additional hyperparameters when using side information

Table 4. Baselines

Best Hyperparameters per Algorithm

Baselines

GRU4REC-FT

GRU4REC-SWT

V-SkNN

STAN

VSTAN

Transformers with only item id feature

GRU

GPT2

TransformerXL

XLNet-CausalLM

XLNet-MLM

XLNet-PLM

XLNet-RTD

ELECTRA

ALBERT

XLNET MLM with side information features

XLNet-MLM-all-concat

XLNet-MLM-all-concat-numeric_soft_embedding

XLNet-MLM-all-elementwise

Files

Appendix_C-Hyperparameters.md

Latest commit

History

Appendix_C-Hyperparameters.md

File metadata and controls

Appendix C - Hypertuning - Search space and best hyperparameters

Table 1. Average runtime (in minutes) of the 100 hypertuning trials by algorithm and dataset

Hypertuning Search Space

Table 2. Algorithms using the Transformers4Rec Meta-Architecture - Transformers and GRU baseline - using only the item id feature

Table 3. XLNet (MLM) - Additional hyperparameters when using side information

Table 4. Baselines

Best Hyperparameters per Algorithm

Baselines

GRU4REC-FT

GRU4REC-SWT

V-SkNN

STAN

VSTAN

Transformers with only item id feature

GRU

GPT2

TransformerXL

XLNet-CausalLM

XLNet-MLM

XLNet-PLM

XLNet-RTD

ELECTRA

ALBERT

XLNET MLM with side information features

XLNet-MLM-all-concat

XLNet-MLM-all-concat-numeric_soft_embedding

XLNet-MLM-all-elementwise