Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRNN Datapreprocessing script #124

Open
wants to merge 3 commits into
base: harsha/reorg
Choose a base branch
from

Conversation

pushkalkatara
Copy link
Contributor

@pushkalkatara pushkalkatara commented Aug 21, 2019

Hi @harsha-simhadri ,
this is a quick implementation of the script process_google.py.
Also, I have checked SRNN and the accuracy is in the .ipynb of the PR.
Solves issue #122

@pushkalkatara pushkalkatara changed the title SRNN Datapreprocessing script #122 SRNN Datapreprocessing script Aug 21, 2019
@metastableB
Copy link
Contributor

@pushkalkatara is there any reason you preferred h5py over numpy.memmap?

@pushkalkatara
Copy link
Contributor Author

@metastableB numpy.memmap does not store the dims, dtypes, thus we would have to mention the test, train, val dims and dtypes in SRNN_example.py. Also, I have seen generally h5py or pandas being used for the purpose. We can shift to numpy.memmap if extra dependency is an issue.

@metastableB
Copy link
Contributor

@pushkalkatara Yes I am apprehensive about adding an extra dependency just for one script though I must admit I don't have an idea of how complex the code will become if we do plain numpy. Lets use pandas instead? Its already part of the requirements here.

@harsha-simhadri
Copy link
Collaborator

@metastableB are you able to fix this using pandas?

@metastableB
Copy link
Contributor

@pushkalkatara do you want me to take over or are you working on this?

@pushkalkatara
Copy link
Contributor Author

I can work on it. We would require to save the pandas data-frame in a format csv or pickel or h5. which one should i use?

@metastableB
Copy link
Contributor

Thanks!

Ah, I did not think this through. CSV will causes file sizes to bloat. It seems pickel is the best route as numpy.load(here) also supports loading from pickled files.

We might have to change the scripts to reflect the new files names.

@metastableB
Copy link
Contributor

@pushkalkatara Any updates?

@pushkalkatara
Copy link
Contributor Author

@metastableB Yes, I'll make the changes today.

@harsha-simhadri harsha-simhadri force-pushed the harsha/reorg branch 6 times, most recently from 2120eb9 to 7f90603 Compare October 20, 2019 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants