-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize html tag replace regex (Fixes #331) #373
Conversation
Although it passes the unit tests (except of course the |
Unrelated failing tests fixed here |
@mishig25 I see it fails on the case:
Should be simple to fix. I'll also add this as a test case. |
Mostly edge-cases Note: HTML is quite forgiving and will still parse/render tags even when some tags are mismatched or missing. The previous regex would barf on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks simpler to me, thanks! Still would like for @mishig25 to double-check since he originally wrote this part, to make sure we are not forgetting any other use-case.
Agreed! :) Other than huggingface/transformers#23646, everything seems to convert correctly. To be sure, we can perhaps run a diff between the two? |
Note that the tool is used across all Hugging Face repos, so there might be some issues specific to other repos too. |
@mishig25 @sgugger Now that huggingface/transformers#23646 and #374 are both merged, could we re-run the gh actions so we can preview the documentation? 😇 Once they pass, I will do some more tests for the other repos. |
Closes #23625 Testing huggingface/doc-builder#373
Failing tests related to #374 (which have been fixed) |
oh I see, @xenova could you update this PR branch with the |
Done 👍 . Looks like the docs built fine (https://moon-ci-docs.huggingface.co/docs/transformers/pr_23867) |
I imagine that since HF is a large org, GH probably provisions pretty decent hardware for actions, so the speedup isn't as dramatic as it is for Transformers.js (3000x speedup). But here it is for the Transformers: I'm not sure what the rest of the 8 min difference is, but the difference for building mdx files is pretty noticeable too. |
I came to this issue while trying to preview |
@mishig25 friendly ping. |
My docs on transformers.js builds are getting quite long (e.g., the latest one is ~21 mins), so it would be nice to get this merged 😅 @mishig25 I could switch to use this branch in the meantime, but I think other repos will also benefit from this. |
We can address any of Mishig's comments in a followup PR if needed. Let's not block everyone as doc-building is quite long on all Transformers projects! |
Thanks so much for merging @sgugger! Just did a test now and it sped up the building for transformers.js by 7000x 🤯 (0.0355it/s-> 246.61it/s)
and brought down the entire process from 21 mins to 4 mins (with 3 mins spent initializing containers) |
Went from 30 minutes to 12 minutes for Transformers :-) |
amazing @xenova super fast now!! |
Hey, I just found out about this change as it just broke
The new regex considers it as an HTML tag which ends up breaking the doc build as the @xenova Given how great the speed improvement of your change is, I'll try to update the regex but still keep your approach. I'll keep you updated :) EDIT: Opened a new PR to fix the regex: #394. |
Building/previewing the docs for Transformers.js takes >12 minutes on my system and ~15 minutes with GitHub actions:
After investigating further, its due to this line of code, which consumes >99% of build time:
doc-builder/src/doc_builder/convert_md_to_mdx.py
Lines 85 to 87 in c59346e
The regex is very strict, and requires that the start and end tag match. However, in practice, this is not necessary at all (and in fact, will break on slightly malformed html tags). A better (and much faster approach) is to just replace the
<
character of any valid start/end tag (regardless of whether they have a matching pair). That is the purpose of this PR.Before:
After:
which is a ~3000x speedup.
This fixes #331, which reported the same issue.