Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERFORMANCE: Make AIP runnable through Dask or another platform to parallize the parsing #5

Open
lfdversluis opened this issue Jun 24, 2020 · 3 comments
Labels
BSc project enhancement New feature or request

Comments

@lfdversluis
Copy link
Collaborator

Each sub-set of data and each data source can be processed in parallel. Dask can be used to parallelize this.

@lfdversluis lfdversluis added enhancement New feature or request BSc project labels Jun 24, 2020
@lfdversluis
Copy link
Collaborator Author

https://joblib.readthedocs.io/en/latest/ Seems promising.

@lfdversluis lfdversluis changed the title Make AIP runnable through Dask or another platform to parallize the parsing PERFORMANCE: Make AIP runnable through Dask or another platform to parallize the parsing Sep 6, 2020
@lfdversluis
Copy link
Collaborator Author

Perhaps investigating if the XML file and the JSON files of Semantic Scholar / AMiner can be processed at an item-level parallelization might me interesting. With joblib linked above, file-level parallelization becomes possible, yet the JSON files are structured in such a way that each line in the file is one (standalone) JSON object. Perhaps parsing these in parallel is even faster.

@lfdversluis
Copy link
Collaborator Author

Setting up some benchmarks + regression tests might be a nice idea as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BSc project enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant