Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want documentation (or even feature?) to parallelize local scans #840

Open
jimklimov opened this issue May 9, 2024 · 1 comment
Open

Comments

@jimklimov
Copy link

I have a generated site with several thousand pages, and htmlproofer can take upwards of an hour of CPU churn to come up with verdicts... Using one CPU core for that, while others are idling.

From documentation (README) I see mentions of further projects involved in this one, such as "nokogiri", "typhoeus" or "hydra" (which say little to someone who does not deal with Ruby ecosystem). Per https://github.com/gjtorikian/html-proofer#configuring-typhoeus-and-hydra (and https://github.com/typhoeus/typhoeus itself) it seems that the project deals with parallel remote web server queries, while the "hydra" has a max_concurrency setting.

I can actually request the latter via CLI, but (as of htmlproofer 3.19.2 in Debian 12) this seems to have no effect - at least, the system is busy for over half an hour using one CPU core and did not even report the amount of pages it would parse:

:; time htmlproofer --disable-external --hydra-config='{"max_concurrency": 6}' ./networkupstools.github.io/
Running ["ImageCheck", "ScriptCheck", "LinkCheck"] on ["./networkupstools.github.io/"] on *.html...

### top:
%Cpu(s): 13.0 us,  0.5 sy,  0.0 ni, 71.6 id,  1.7 wa,  0.0 hi,  0.5 si, 12.7 st
...
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 940002 abuild    20   0  882620 785876  10360 R 100.0   9.7  37:33.86 htmlproofer

(note that the JSON here has to be really well formatted, with quoted strings for object item names -- not plain tokens as in Ruby examples of the README).

I have just recently started exploring this tool, so have no idea if it actually has support for parallelized local scans.

I'd expect it to read a page, toss links to neighboring pages into the queue (unless those were already "known" - queued, being handled, or fully processed) and go on with analysis of the current page, finally pick another page from the queue, rinse, repeat.

This would be quite welcoming for parallelization (with a synchronized access to the queue singleton), with many single-thread runners processing one page each and tossing links into the queue if they are yet "unknown" and returning to pick up another page from the queue.

Again - no idea if something of the sort already exists and needs just documentation, or needs to be designed and implemented as well. But in any case, it would be really welcome to cut a (multi-)hour scan to minutes, so it would be actually helpful in regular CI sanity checks vs. one-off developer trials.

@jimklimov
Copy link
Author

UPDATE: Running a custom build of 5.0.9 (or rather current github master) I see the link and file counters appear much faster - kudos. Still, a hydra concurrency setting does not make more use of CPU cores. New-version wording of the command is:

:; time htmlproofer --disable-external --hydra='{"max_concurrency": 6}' ./networkupstools.github.io/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant