Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipinyou data size and base line code #8

Open
Sandy4321 opened this issue Jan 27, 2020 · 5 comments
Open

ipinyou data size and base line code #8

Sandy4321 opened this issue Jan 27, 2020 · 5 comments

Comments

@Sandy4321
Copy link

zipped
ipinyou is 249 MB
and uzipeed 1.5 gb
in
https://drive.google.com/drive/folders/1thXezQbmuS6Q8-AXmrhB0tLM3mybJxVR?usp=sharing

but
https://github.com/wnzhang/make-ipinyou-data
stated that

After the program finished, the total size of the folder will be 14G.

so it is because hdf5
data in
https://drive.google.com/drive/folders/1thXezQbmuS6Q8-AXmrhB0tLM3mybJxVR?usp=sharing
so small , or some clarifications needed?

I understand that due to
removed the user-tag feature considering leaky problems
some data file reduction happen as well

May you please share some baseline code to try this data
then everything will be clear

image

image

@Atomu2014
Copy link
Owner

hdf5 is a compressed file format
you should check the number of examples instead of file size
I have shared all baselines compared in my papers, see https://github.com/Atomu2014/product-nets and https://github.com/Atomu2014/product-nets-distributed

@Sandy4321
Copy link
Author

great thanks a lot
but I am looking for really simple python baseline
without complicated packages as TF
do you have one or do you know somebody who has
performance is not important , I try just learn for very beginning ?

@Atomu2014
Copy link
Owner

Hi, I suggest you can try these packages: xgboost > libfm > libffm
search them on the Internet and find the official guide
these packages are easy to try since you don't need to touch the model, and the only thing yous should do is just preparing the data and call API / CLI

@Sandy4321
Copy link
Author

great so where to get prepossessed Criteo data set?
per
The original dataset is know as Criteo 1TB click log, in which the CriteoLab has collected 30 days of masked data. We only know there are 13 numerical and 26 categorical features, and there is no feature description released. Thus we name thease features as num_0 ... num_12, and cat_0 ..., cat_25.

@Atomu2014
Copy link
Owner

Hi, there are 2 download links in the "Download" section of README.
The processed dataset only contains 8 days' logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants