Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue #33

Open
michaelaye opened this issue Aug 10, 2016 · 13 comments
Open

Performance issue #33

michaelaye opened this issue Aug 10, 2016 · 13 comments
Labels
help wanted state:needs follow up This issue has interesting suggestions / ideas worth following up on type:enhancement
Milestone

Comments

@michaelaye
Copy link
Contributor

Everytime i go into a repo with lots of notebooks, it takes several seconds before i get my prompt back, which can be annoying...
I'm wondering if the official PreProcessor of the notebook tools is faster than your manual filtering?

@jankatins
Copy link

I'm not sure how the prompt (of your shell?) can be affected by nbstripoutput, but if you want fast, try jq: http://janschulz.github.io/windows-dev-environment.html -> "proper diffs and commits for notebooks" (sorry, my homepage seems to not put in anchors in headlines :-( )

@michaelaye
Copy link
Contributor Author

oh, that's because my prompt (zsh->garrett) includes git status information, and the status is defined by git getting active, using the nbstripout filter to define the status for each notebook.

@jankatins
Copy link

Huih, I've currently 38 ipynb in a folder with a jq based stripoutput filter (*.ipynb filter=stripoutput) and it's fast in cmder (which also tells me the current git status). in my case git status is also fast (instant?). cmder uses git status --porcelain 2>nul to check if the dir is dirty...

@michaelaye
Copy link
Contributor Author

are u aware that git apparently is able to cache the status somehow? It's only slow for me for the first time entering the folder after I-dont-know-what-interval-or-action. After that it's seemless as well. 66 notebooks here though. ;)

@jankatins
Copy link

Thats sounds more like a filesystem cache (the git status manpage doesn't say anything about a cache for status information)? Do you have a ssd or hdd?

@michaelaye
Copy link
Contributor Author

ssd.

@kynan
Copy link
Owner

kynan commented Aug 18, 2016

@michaelaye Do you have core.preloadindex enabled? That might be it:

Enable parallel index preload for operations like git diff

This can speed up operations like git diff and git status especially on filesystems like NFS that have weak caching semantics and thus
relatively high IO latencies. With this set to true, Git will do the index comparison to the filesystem data in parallel, allowing overlapping
IO’s.

@JanSchulz Thanks for the pointer to jq. Not sure I'd want to introduce such a heavy dependency, but will have a play with it.

@stas00
Copy link

stas00 commented Aug 17, 2018

FYI, I uploaded two versions of much faster nbstripout-like tools. They focus on being fast git strip out filters, without much else.

  1. nbstripout-fast - a pure python json version (it also supports rapidjson, but I found it to be slower)
  2. nbstripout-jq - jq version with a thin /bin/sh wrapper

As of this writing nbstripout-fast and nbstripout-jq are respectively 13 and 9 times faster than nbstripout with the notebooks I tested them with.

They aren't exactly the same as nbstripout, so you may need to apply a few minor tweaks if you need the exact behavior.

These were written for fastai - we ended up using the pure python version as it was the fastest.

Enjoy.

@kynan
Copy link
Owner

kynan commented Aug 19, 2018

Thanks, @stas00! Feel free to send a PR for the rapidjson variant. I'd also be open to add the nbstripout-jq as a separate script.

@stas00
Copy link

stas00 commented Aug 19, 2018

@kynan, the python rewrite is very different internally - so PR won't be possible. It doesn't use any of the nbconvert/nbformat stuff. it's faster than rapidjson or jq. On the other hand it only does the stripping out (no installation tools, validation, etc.).

I first tried to get nbstripout to use rapidjson or ujson, but neither has an identical API to json. So that didn't work.

Given that most likely the strip out has to be as fast as possible for using git with many notebooks, i'd say the original nbstripout needs to be split into 2 pieces - one optimized for speed of the critical path of doing the strip out and another for installation/configuration/validation/etc. If you do that then you can just use nbstripout-fast as is - with a few tweaks to support keep_outputs and and a few other flags. And leave the rest of the features in the other script.

Also, nbstripout-fast is already written in the whitelist format as you suggested here, except the fields are hardcoded and will probably need to be configurable to make it generic enough for anybody to use without changes to the source code.

And yes, feel free to incorporate nbstripout-jq in your repo. it too will need some tweaks to be identical to nbstripout. We dropped the jq version since pure python json rewrite nbstripout-fast is faster.

p.s. if you decide to integrate/adopt the code the -d option keeps outputs and also preserves a few other metadata fields that we use for documentation notebooks, which differs from code notebooks. Obviously this won't be useful for a general user.

@kynan kynan added the state:needs follow up This issue has interesting suggestions / ideas worth following up on label Nov 10, 2019
@kynan kynan added this to the Backlog milestone Jun 28, 2020
@mlucool
Copy link

mlucool commented Dec 8, 2022

Hi,

This is a really excellent project, but like above I found it had too much overhead on larger repos. I wrote a pure rust version (with python bindings so it can be pip installed) located at https://github.com/deshaw/nbstripout-fast (happed to chose the same name as @stas00). My testing shows this is ~200x faster. Really intersted to hear your thoughts @kynan. This is not a true a 1:1 match, but all key features should be included and a few more added.

Is there a way in which this project could let users choose to use the rust version? Your install/setups scripts are great and clearly this project is very very popular. I think some sort of linkage there would be a net postive for the community.

@kynan
Copy link
Owner

kynan commented Dec 10, 2022

This is a true a 1:1 match, but all key features should be included and a few more added.

Do you mean this is not a true 1:1 match?

Is there a way in which this project could let users choose to use the rust version?

In theory yes, however doesn't it feel a little odd to have one tool install another?

Can you give more detail on what exactly you're thinking of? How do you see nbstripouts installation work when choosing nbstripout-fast ? I also think this discussion should be moved to a separate issue.

@mlucool
Copy link

mlucool commented Dec 10, 2022

Do you mean this is not a true 1:1 match?

Yes - sorry!

I also think this discussion should be moved to a separate issue.

Opened #179

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted state:needs follow up This issue has interesting suggestions / ideas worth following up on type:enhancement
Projects
None yet
Development

No branches or pull requests

5 participants