Data/Model storage #1453

menshikh-iv · 2017-06-27T13:09:37Z

We want to store trained models and popular dataset (in raw/preprocessed format). Also, we want to develop a simple API for accessing this data.

This project makes our users a bit happier.

Plan:

Check how another frameworks deal with sharing data (sklearn, spacy, nltk, etc). Describe the advantages and disadvantages.
Look at Google storage offer for open-source projects (or other offers for an open-source project). We want to host data without payments (because traffic might be substantial).
Implement the API for downloading these external, potentially large datasets (based on research from 1.)
Create popular models and datasets and upload them to the storage
Write a clear, beginner-friendly tutorial (Jupyter notebook) on how to use this functionality.

macks22 · 2017-06-28T15:17:10Z

Issues #717 and #746 are closely related to this.

souravsingh · 2017-06-30T10:54:55Z

@menshikh-iv sklearn stores smaller datasets and models in a separate folder and also provides a fetcher for datasets which could be large or require preprocessing. The datasets can be downloading by importing the necessary dataset namespace from sklearn.datasets

NTLK provides a downloader which can be imported to download all the datasets available.

For Storing the datasets, we can store them in the repo if they aren't large, or write a downloader script which can do the job.

menshikh-iv · 2017-07-03T05:51:22Z

@souravsingh thanks for info, let's wait for detailed comparison from @chaitaliSaini

chaitaliSaini · 2017-07-05T09:17:36Z

NLTK : It provides downloader with several interfaces(interactie installer and installation via command line) which can be used to download corpora, models, and other data packages that can be used with NLTK.(https://github.com/nltk/nltk/blob/develop/nltk/downloader.py)
Eg : nltk.download() will open a new window of NLTK downloader and after that user can select the packages that they want to download.

sklearn : sklearn comes with a few small standard datasets that do not require to download any file from some external website. They have store other datasets on mldata.org and sklearn.datasets package is able to directly download data sets from the repository.
(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/mldata.py)
Eg : to download the MNIST digit recognition database:

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home=custom_data_home)

spacy : It allows models to be downloaded and loaded manually, or using spaCy's download and link commands.
(https://github.com/explosion/spaCy/blob/master/spacy/cli/download.py)
Eg : python -m spacy download en
The download command will install the model via pip, place the package in the site-packages directory and create a shortcut link that lets the user load the model by name.
OR installing directly via pip
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-1.2.0/en_core_web_md-1.2.0.tar.gz
OR
user can download it manually. To use it with spaCy one has to assign it a name by creating a shortcut link for the data directory.
python -m spacy link [package name or path] [shortcut] [--force]

For Storage :
Google Storage : They do not provide free cloud services to open source projects and if we want to avail free cloud services, then they have 2 options.
1.we can use google cloud free for one year.
2.we can always use google cloud for free but with some restriction in number of monthly queries,etc.
(https://cloud.google.com/free/)

mldata.org : It's a public repository for datasets. Its free of cost. Dataset's file size is limited to 1Gb.
(http://mldata.org/)

menshikh-iv · 2017-07-08T06:46:52Z

For my opinion, we should use "hybrid" approach sklearn+spacy.
For "programming" way we will use several methods:

lookup("datasets") / lookup("models") - returns list of datasets/models with short description. Also should works lookup("models/fastText") for example
fetch_data("path/to/model", output_folder="/a/b/c") / fetch_data("path/to/dataset", output_folder="/a/b/c") - download dataset and store in folder, return path to main file (that we should use for load model for example)

For example, a user wants to download english Wikipedia and store it to hdd on local machine:
fetch_data("dataset/wikipedia/english", output_folder="/home/username/my_storage/")

For "console" way we will use same methods, that calls across submodule

python -m gensim downloader.fetch_data ...
python -m gensim downloader.lookup ...

What do you think @gojomo @piskvorky @chaitaliSaini?

souravsingh · 2017-07-08T07:36:28Z

@menshikh-iv We can talk to Rackspace for their cloud hosting service. Most of the open-source projects( MacPython, scikit-learn and manylinux) use Rackspace hosting.

menshikh-iv · 2017-07-08T09:19:03Z

@souravsingh we used too (as temporary storage for wheels), need to investigate this question.

menshikh-iv · 2017-07-12T12:59:26Z

I investigate SpaCy approach for Data storage and this approach is awesome! Look at spacy-models repo. They attach models to release in github.

It's unlimited for cumulative file size/queries/etc and free, only one limitation - file size < 2Gb.
I think this is the best approach for model/dataset storage 👍 .

gojomo · 2017-07-12T16:34:35Z

Another option for big, large-traffic datasets where gensim would want to be insulated from the potential costs-of-download-popularity is AWS S3 "requester-pays" buckets. Arxiv uses them; see:

https://arxiv.org/help/bulk_data_s3

akutuzov · 2017-10-02T14:48:08Z

menshikh-iv · 2017-10-11T10:38:33Z

New plan proposal

Need to implement 2 functions: load and info (this functions for public API)

Naming convention:
for datasets: lowercase + replace spaces to '-', something like "en-wikipedia-full" (no strict structure)
for models: <model_type>-<model_name>-<dimension-of-vector?>

def info(name=None) - information about data, returns json with info
For models
- what's source (whence the model, links to detailed description OR describe it here)
- parameters
- related papers (if needed)
- the dataset that used for training (if we know)
- link to preprocessing code (if we have)

For datasets:
- what's source (whence the model, links to detailed description OR describe it here)
- related papers (if needed)

if name==None - return full json with data

def load(name, return_model_path=False) - download + load data
For models: return loaded model OR path to folder with models
For dataset: return path to folder with dataset

Algorithm

If not exists ~/gensim-data:
- create this folder
If name is available in ~/gensim-data:
- return loaded to memory / path (depends on return_model_path and what's concrete (it's model or dataset))
If name is available in github-storage
- download archive to temporary directory (tempfile.mkdtemp)
- check hash for archive, if some problems detected - raise an Exception and recommend re-run load
- unpack it ibid + remove original archive
- rename('/tmp/randomtempfolder/<data_name>', '~/gensim-data/<data_name>')
- Goto (2)

Additional requirements:

Store all (data + info) on GitHub (we will not proxy the links)
No need to support aliases (because we'll have detailed description for data)
CLI support (simple main function with argparse + needed calls)
Instruction for loading new data to GitHub
On GitHub, we store data, info + code for loading this to memory
Tests

piskvorky · 2017-10-11T19:16:21Z

Some examples would be helpful. For example, if name==None - return full json with data -- does this mean the structure of the response is different (a list of what would be returned when name != None)?

Otherwise the functionality looks great in general: I especially like the idea with the "related papers / preprocessing code".

What does No need to support aliases (because we'll have detailed description for data) mean? What are "aliases"?

For dataset: return path to folder with dataset -- what is the effect of return_model_path=False? Do we always return path, or do we return an open object (data iterator)?

menshikh-iv · 2017-10-12T10:06:36Z

@piskvorky

What does No need to support aliases (because we'll have a detailed description for data) mean? What are "aliases"?

Earlier I think that this is a very useful feature (because users know datasets by different names), but now, if we add more descriptions (like related papers, etc), this is not needed.

For dataset: return path to the folder with dataset -- what is the effect of return_model_path=False? Do we always return path, or do we return an open object (data iterator)?

So, I think return_model_path should have any effect only for models (no any effects for datasets)
For datasets, I propose a path to the folder, because of a lot of datasets, for example, separated to different files.

Some examples would be helpful. For example, if name==None - return full json with data -- does this mean the structure of the response is different (a list of what would be returned if name != None)?

Agree, let me show an example

if name==None - it should be full dump (dict with information about all models)

{
	"models":
	{
		"word2vec-googlenews-300": 
		{
			"description": "Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases.",
			"parameters": "dimension=300",
			"papers": "https://arxiv.org/abs/1310.4546, https://arxiv.org/abs/1301.3781",
			"dataset": "Google news",
			"language": "en"
		},

		"glove-twitter-50": 
		{
			"description": "Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/",
			"parameters": "dimension=50",
                        "preprocessing": "Converted to w2v format with `python -m gensim.scripts glove2word2vec <fname>`",
			"papers": "https://nlp.stanford.edu/pubs/glove.pdf",
			"dataset": "Twitter"
		},
		"glove-commoncrawl-300":
		{
			"description": "Pre-trained vectors, 42B tokens, 1.9M vocab, uncased. https://nlp.stanford.edu/projects/glove/",
			"parameters": "dimension=50",
                        "preprocessing": "Converted to w2v format with `python -m gensim.scripts glove2word2vec <fname>`",
			"papers": "https://nlp.stanford.edu/pubs/glove.pdf",
			"dataset": "Common Crawl"
		}
	},

	"datasets":
	{
		"text8"
		{
			"description": "Cleaned small sample from wikipedia"
		}
	}
}

If name contains in json - only "leaf", i.e. if name=="glove-commoncrawl-300", output is

{
	"description": "Pre-trained vectors, 42B tokens, 1.9M vocab, uncased. https://nlp.stanford.edu/projects/glove/",
	"parameters": "dimension=50",
                "preprocessing": "Converted to w2v format with `python -m gensim.scripts glove2word2vec <fname>`",
	"papers": "https://nlp.stanford.edu/pubs/glove.pdf",
	"dataset": "Common Crawl"
}

If name doesn't contain in json - exception will be raised.

piskvorky · 2017-10-12T15:03:00Z

It all sounds good to me.

In addition, I'd suggest the combination of {resource is data + return_model_path=False (better: return_path=False?)} would return an already open object. Where open = ready for whatever activity is usually done with this corpus: simple lines iterator (just path opened with smart_open), or iterator splitting into tokens (sentences), something else. Basically open the dataset with the most-common-usecase class. If the user wants to do something else with this data resource, they'd set return_path=True and then open the corpus in another way.

@gojomo @jayantj thoughts?

menshikh-iv · 2017-10-12T15:14:09Z

@piskvorky there is no single-valued, so I would not like to do so.
For me, if I loaded a dataset, I think that all of them in memory (no open('...') that raise me StopIteration after one scan).
For the one hand, line-iterator is better (as more simple and universal solution), but what's about datasets with multiple files? Raise an exception?
For another hand - tokenized dataset is good too, but how to resolve this situation (what's will be current "view" of dataset?), for this we also should store different versions: raw, tokenized, etc. I don't think that this is very useful (and also more complicated).

I don't see a good simple solution for this problem.

piskvorky · 2017-10-29T20:41:44Z

@menshikh-iv I don't understand the English. What does there is no single-valued mean?

Same with if I loaded a dataset, I think that all of them in memory (no open('...') that raise me StopIteration after one scan).

My suggestion was to return a ready corpus (iterable) for return_model_path=False, and a path otherwise.

menshikh-iv · 2017-10-30T06:02:27Z

I don't understand the English. What does there is no single-valued mean?

Same with if I loaded a dataset, I think that all of them in memory (no open('...') that raise me StopIteration after one scan).

I mean that we have no universal solution, because

Dataset can be already split into several files, what's the file we should open?
Typical open allows you to read only once, this is not very convenient

piskvorky · 2017-11-03T17:18:53Z

OK, but even if a dataset is split across several files, we still need some code/class to access and use that dataset, right? So let's return that from load. Otherwise what's the point of the dataset? Or can you give some example of what you mean.

Same for the second point: whatever the user has to typically do with such opened file, we can do for him automatically. And if he cannot do anything, then why even include the file?

menshikh-iv · 2017-11-14T08:38:16Z

@macks22 @akutuzov @gojomo if you want to add any model/dataset - feel free to contribute to https://github.com/RaRe-Technologies/gensim-data

* added download and catalogue functions * added link and info * modeified link and info functions * Updated download function * Added logging * Added load function * Removed unused imports * added check for installed models * updated download function * Improved help for terminal * load returns model path * added jupyter notebook and merged code * alternate names for load * corrected formatting * added checksum after download * refactored code * removed log file code * added progressbar * fixed pep8 * added tests * added download for >2gb data * add test for multipart * fixed pep8 * remove tar.gz, use only .gz for all * fix codestyle/docstrings[1] * add module docstring * add downloader to apiref * Fix CLI + more documentation * documentation for load * renaming * fix tests * fix tests[2] * add test for info * reduce logging * Add return_path=True example to docstring * fix * update & rename notebook * Fix docstring + use ValueError when name is incorrect * move list to global var

macks22 mentioned this issue Jul 9, 2017

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1477

Open

menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 2, 2017

menshikh-iv mentioned this issue Oct 2, 2017

Link to common datasets #746

Closed

menshikh-iv mentioned this issue Nov 10, 2017

[MRG] Data/model storage. Fix 1453 #1705

Merged

menshikh-iv closed this as completed in f99612d Nov 14, 2017

menshikh-iv mentioned this issue Nov 14, 2017

Add more datasets/models to gensim-data #1717

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data/Model storage #1453

Data/Model storage #1453

menshikh-iv commented Jun 27, 2017 •

edited by piskvorky

Loading

macks22 commented Jun 28, 2017

souravsingh commented Jun 30, 2017

menshikh-iv commented Jul 3, 2017 •

edited

Loading

chaitaliSaini commented Jul 5, 2017 •

edited by menshikh-iv

Loading

menshikh-iv commented Jul 8, 2017 •

edited

Loading

souravsingh commented Jul 8, 2017

menshikh-iv commented Jul 8, 2017

menshikh-iv commented Jul 12, 2017

gojomo commented Jul 12, 2017

akutuzov commented Oct 2, 2017

menshikh-iv commented Oct 11, 2017 •

edited

Loading

piskvorky commented Oct 11, 2017 •

edited

Loading

menshikh-iv commented Oct 12, 2017 •

edited

Loading

piskvorky commented Oct 12, 2017 •

edited

Loading

menshikh-iv commented Oct 12, 2017 •

edited

Loading

piskvorky commented Oct 29, 2017 •

edited

Loading

menshikh-iv commented Oct 30, 2017

piskvorky commented Nov 3, 2017 •

edited

Loading

menshikh-iv commented Nov 14, 2017

Data/Model storage #1453

Data/Model storage #1453

Comments

menshikh-iv commented Jun 27, 2017 • edited by piskvorky Loading

Plan:

macks22 commented Jun 28, 2017

souravsingh commented Jun 30, 2017

menshikh-iv commented Jul 3, 2017 • edited Loading

chaitaliSaini commented Jul 5, 2017 • edited by menshikh-iv Loading

menshikh-iv commented Jul 8, 2017 • edited Loading

souravsingh commented Jul 8, 2017

menshikh-iv commented Jul 8, 2017

menshikh-iv commented Jul 12, 2017

gojomo commented Jul 12, 2017

akutuzov commented Oct 2, 2017

menshikh-iv commented Oct 11, 2017 • edited Loading

piskvorky commented Oct 11, 2017 • edited Loading

menshikh-iv commented Oct 12, 2017 • edited Loading

piskvorky commented Oct 12, 2017 • edited Loading

menshikh-iv commented Oct 12, 2017 • edited Loading

piskvorky commented Oct 29, 2017 • edited Loading

menshikh-iv commented Oct 30, 2017

piskvorky commented Nov 3, 2017 • edited Loading

menshikh-iv commented Nov 14, 2017

menshikh-iv commented Jun 27, 2017 •

edited by piskvorky

Loading

menshikh-iv commented Jul 3, 2017 •

edited

Loading

chaitaliSaini commented Jul 5, 2017 •

edited by menshikh-iv

Loading

menshikh-iv commented Jul 8, 2017 •

edited

Loading

menshikh-iv commented Oct 11, 2017 •

edited

Loading

piskvorky commented Oct 11, 2017 •

edited

Loading

menshikh-iv commented Oct 12, 2017 •

edited

Loading

piskvorky commented Oct 12, 2017 •

edited

Loading

menshikh-iv commented Oct 12, 2017 •

edited

Loading

piskvorky commented Oct 29, 2017 •

edited

Loading

piskvorky commented Nov 3, 2017 •

edited

Loading