-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data/Model storage #1453
Comments
@menshikh-iv sklearn stores smaller datasets and models in a separate folder and also provides a fetcher for datasets which could be large or require preprocessing. The datasets can be downloading by importing the necessary dataset namespace from NTLK provides a downloader which can be imported to download all the datasets available. For Storing the datasets, we can store them in the repo if they aren't large, or write a downloader script which can do the job. |
@souravsingh thanks for info, let's wait for detailed comparison from @chaitaliSaini |
NLTK : It provides downloader with several interfaces(interactie installer and installation via command line) which can be used to download corpora, models, and other data packages that can be used with NLTK.(https://github.com/nltk/nltk/blob/develop/nltk/downloader.py) sklearn : sklearn comes with a few small standard datasets that do not require to download any file from some external website. They have store other datasets on mldata.org and sklearn.datasets package is able to directly download data sets from the repository. from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home=custom_data_home) spacy : It allows models to be downloaded and loaded manually, or using spaCy's download and link commands. For Storage : mldata.org : It's a public repository for datasets. Its free of cost. Dataset's file size is limited to 1Gb. |
For my opinion, we should use "hybrid" approach sklearn+spacy.
For example, a user wants to download english Wikipedia and store it to hdd on local machine: For "console" way we will use same methods, that calls across submodule
What do you think @gojomo @piskvorky @chaitaliSaini? |
@menshikh-iv We can talk to Rackspace for their cloud hosting service. Most of the open-source projects( MacPython, scikit-learn and manylinux) use Rackspace hosting. |
@souravsingh we used too (as temporary storage for wheels), need to investigate this question. |
I investigate SpaCy approach for Data storage and this approach is awesome! Look at spacy-models repo. They attach models to It's unlimited for cumulative file size/queries/etc and free, only one limitation - file size < 2Gb. |
Another option for big, large-traffic datasets where gensim would want to be insulated from the potential costs-of-download-popularity is AWS S3 "requester-pays" buckets. Arxiv uses them; see: |
New plan proposal Need to implement 2 functions: Naming convention:
For datasets: if
Algorithm
Additional requirements:
|
Some examples would be helpful. For example, Otherwise the functionality looks great in general: I especially like the idea with the "related papers / preprocessing code". What does
|
Earlier I think that this is a very useful feature (because users know datasets by different names), but now, if we add more descriptions (like related papers, etc), this is not needed.
So, I think
Agree, let me show an example
If
If |
It all sounds good to me. In addition, I'd suggest the combination of {resource is data + |
@piskvorky there is no single-valued, so I would not like to do so. I don't see a good simple solution for this problem. |
@menshikh-iv I don't understand the English. What does Same with My suggestion was to return a ready corpus (iterable) for |
I mean that we have no universal solution, because
|
OK, but even if a dataset is split across several files, we still need some code/class to access and use that dataset, right? So let's return that from Same for the second point: whatever the user has to typically do with such opened file, we can do for him automatically. And if he cannot do anything, then why even include the file? |
@macks22 @akutuzov @gojomo if you want to add any model/dataset - feel free to contribute to https://github.com/RaRe-Technologies/gensim-data |
* added download and catalogue functions * added link and info * modeified link and info functions * Updated download function * Added logging * Added load function * Removed unused imports * added check for installed models * updated download function * Improved help for terminal * load returns model path * added jupyter notebook and merged code * alternate names for load * corrected formatting * added checksum after download * refactored code * removed log file code * added progressbar * fixed pep8 * added tests * added download for >2gb data * add test for multipart * fixed pep8 * remove tar.gz, use only .gz for all * fix codestyle/docstrings[1] * add module docstring * add downloader to apiref * Fix CLI + more documentation * documentation for load * renaming * fix tests * fix tests[2] * add test for info * reduce logging * Add return_path=True example to docstring * fix * update & rename notebook * Fix docstring + use ValueError when name is incorrect * move list to global var
We want to store trained models and popular dataset (in raw/preprocessed format). Also, we want to develop a simple API for accessing this data.
This project makes our users a bit happier.
Plan:
The text was updated successfully, but these errors were encountered: