Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize html tag replace regex (Fixes #331) #373

Merged
merged 7 commits into from
Aug 22, 2023

Conversation

xenova
Copy link
Contributor

@xenova xenova commented May 20, 2023

Building/previewing the docs for Transformers.js takes >12 minutes on my system and ~15 minutes with GitHub actions:
image

After investigating further, its due to this line of code, which consumes >99% of build time:

_re_lt_html = re.compile(r"<(\S+)([^>]*>)(((?!</\1>).)*)<(/\1>)", re.DOTALL)
while _re_lt_html.search(text):
text = _re_lt_html.sub(r"LTHTML\1\2\3LTHTML\5", text)

The regex is very strict, and requires that the start and end tag match. However, in practice, this is not necessary at all (and in fact, will break on slightly malformed html tags). A better (and much faster approach) is to just replace the < character of any valid start/end tag (regardless of whether they have a matching pair). That is the purpose of this PR.


Before:

Building the MDX files: 100%|██████████| 30/30 [10:35<00:00, 22.71s/it]

After:

Building the MDX files: 100%|██████████| 30/30 [00:00<00:00, 124.46it/s]

which is a ~3000x speedup.


This fixes #331, which reported the same issue.

@xenova
Copy link
Contributor Author

xenova commented May 20, 2023

Although it passes the unit tests (except of course the timm docstring one which we know of), I think we should run before and after this change on the transformers docs and perform a diff to make sure nothing crazy happens.

@xenova
Copy link
Contributor Author

xenova commented May 20, 2023

Unrelated failing tests fixed here

@mishig25 mishig25 requested review from mishig25 and sgugger May 21, 2023 09:11
mishig25 added a commit to huggingface/transformers that referenced this pull request May 21, 2023
@xenova
Copy link
Contributor Author

xenova commented May 21, 2023

@mishig25 I see it fails on the case:

 122 |  <p>TP is almost always used within a single node. That is TP size <= gpus per node.</p>

Should be simple to fix. I'll also add this as a test case.

Mostly edge-cases

Note: HTML is quite forgiving and will still parse/render tags even when some tags are mismatched or missing.

The previous regex would barf on this.
Copy link
Contributor

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks simpler to me, thanks! Still would like for @mishig25 to double-check since he originally wrote this part, to make sure we are not forgetting any other use-case.

@xenova
Copy link
Contributor Author

xenova commented May 22, 2023

This looks simpler to me, thanks! Still would like for @mishig25 to double-check since he originally wrote this part, to make sure we are not forgetting any other use-case.

Agreed! :) Other than huggingface/transformers#23646, everything seems to convert correctly. To be sure, we can perhaps run a diff between the two?

@sgugger
Copy link
Contributor

sgugger commented May 22, 2023

Note that the tool is used across all Hugging Face repos, so there might be some issues specific to other repos too.

@xenova
Copy link
Contributor Author

xenova commented May 22, 2023

Note that the tool is used across all Hugging Face repos, so there might be some issues specific to other repos too.

Right - it's probably best to do what @mishig25 did here for all the repos listed here.

@xenova
Copy link
Contributor Author

xenova commented May 29, 2023

@mishig25 @sgugger Now that huggingface/transformers#23646 and #374 are both merged, could we re-run the gh actions so we can preview the documentation? 😇

Once they pass, I will do some more tests for the other repos.

mishig25 added a commit to huggingface/transformers that referenced this pull request May 30, 2023
@xenova
Copy link
Contributor Author

xenova commented May 30, 2023

Failing tests related to #374 (which have been fixed)

@mishig25
Copy link
Contributor

oh I see, @xenova could you update this PR branch with the main branch ?

@xenova
Copy link
Contributor Author

xenova commented May 30, 2023

Done 👍 . Looks like the docs built fine (https://moon-ci-docs.huggingface.co/docs/transformers/pr_23867)

@xenova
Copy link
Contributor Author

xenova commented May 30, 2023

I imagine that since HF is a large org, GH probably provisions pretty decent hardware for actions, so the speedup isn't as dramatic as it is for Transformers.js (3000x speedup).

But here it is for the Transformers:

Before:
image

After:
image


I'm not sure what the rest of the 8 min difference is, but the difference for building mdx files is pretty noticeable too.

@radames
Copy link

radames commented Aug 8, 2023

I came to this issue while trying to preview xenova/transformers.js docs. Is there a workaround to speed up the preview using docs-preview ? right now it taking considering amount of time just to preview a couple of docs. thanks

@sgugger
Copy link
Contributor

sgugger commented Aug 8, 2023

@mishig25 friendly ping.

@xenova
Copy link
Contributor Author

xenova commented Aug 21, 2023

My docs on transformers.js builds are getting quite long (e.g., the latest one is ~21 mins), so it would be nice to get this merged 😅 @mishig25

I could switch to use this branch in the meantime, but I think other repos will also benefit from this.

@sgugger sgugger merged commit 2d60a3d into huggingface:main Aug 22, 2023
@sgugger
Copy link
Contributor

sgugger commented Aug 22, 2023

We can address any of Mishig's comments in a followup PR if needed. Let's not block everyone as doc-building is quite long on all Transformers projects!

@xenova
Copy link
Contributor Author

xenova commented Aug 22, 2023

Thanks so much for merging @sgugger! Just did a test now and it sped up the building for transformers.js by 7000x 🤯 (0.0355it/s-> 246.61it/s)

Before:

Building the MDX files: 100%|██████████| 34/34 [15:56<00:00, 28.14s/it]

After:

Building the MDX files: 100%|██████████| 34/34 [00:00<00:00, 246.61it/s]

and brought down the entire process from 21 mins to 4 mins (with 3 mins spent initializing containers)

@sgugger
Copy link
Contributor

sgugger commented Aug 22, 2023

Went from 30 minutes to 12 minutes for Transformers :-)

@radames
Copy link

radames commented Aug 22, 2023

amazing @xenova super fast now!!

@Wauplin
Copy link
Contributor

Wauplin commented Aug 29, 2023

Hey, I just found out about this change as it just broke huggingface_hub's CI. I have a docstring here in which both a "<" and a ">" are used but not as HTML tag:

...
... Small files (<5MB) are duplicated in ...
...
... README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812 ....

The new regex considers it as an HTML tag which ends up breaking the doc build as the < char is not escaped properly.

@xenova Given how great the speed improvement of your change is, I'll try to update the regex but still keep your approach. I'll keep you updated :)

EDIT: Opened a new PR to fix the regex: #394.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

preview taking a lot of time
5 participants