You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a generated site with several thousand pages, and htmlproofer can take upwards of an hour of CPU churn to come up with verdicts... Using one CPU core for that, while others are idling.
I can actually request the latter via CLI, but (as of htmlproofer 3.19.2 in Debian 12) this seems to have no effect - at least, the system is busy for over half an hour using one CPU core and did not even report the amount of pages it would parse:
:; time htmlproofer --disable-external --hydra-config='{"max_concurrency": 6}' ./networkupstools.github.io/
Running ["ImageCheck", "ScriptCheck", "LinkCheck"] on ["./networkupstools.github.io/"] on *.html...
### top:
%Cpu(s): 13.0 us, 0.5 sy, 0.0 ni, 71.6 id, 1.7 wa, 0.0 hi, 0.5 si, 12.7 st
...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
940002 abuild 20 0 882620 785876 10360 R 100.0 9.7 37:33.86 htmlproofer
(note that the JSON here has to be really well formatted, with quoted strings for object item names -- not plain tokens as in Ruby examples of the README).
I have just recently started exploring this tool, so have no idea if it actually has support for parallelized local scans.
I'd expect it to read a page, toss links to neighboring pages into the queue (unless those were already "known" - queued, being handled, or fully processed) and go on with analysis of the current page, finally pick another page from the queue, rinse, repeat.
This would be quite welcoming for parallelization (with a synchronized access to the queue singleton), with many single-thread runners processing one page each and tossing links into the queue if they are yet "unknown" and returning to pick up another page from the queue.
Again - no idea if something of the sort already exists and needs just documentation, or needs to be designed and implemented as well. But in any case, it would be really welcome to cut a (multi-)hour scan to minutes, so it would be actually helpful in regular CI sanity checks vs. one-off developer trials.
The text was updated successfully, but these errors were encountered:
UPDATE: Running a custom build of 5.0.9 (or rather current github master) I see the link and file counters appear much faster - kudos. Still, a hydra concurrency setting does not make more use of CPU cores. New-version wording of the command is:
:; time htmlproofer --disable-external --hydra='{"max_concurrency": 6}' ./networkupstools.github.io/
I have a generated site with several thousand pages, and
htmlproofer
can take upwards of an hour of CPU churn to come up with verdicts... Using one CPU core for that, while others are idling.From documentation (README) I see mentions of further projects involved in this one, such as "nokogiri", "typhoeus" or "hydra" (which say little to someone who does not deal with Ruby ecosystem). Per https://github.com/gjtorikian/html-proofer#configuring-typhoeus-and-hydra (and https://github.com/typhoeus/typhoeus itself) it seems that the project deals with parallel remote web server queries, while the "hydra" has a
max_concurrency
setting.I can actually request the latter via CLI, but (as of htmlproofer 3.19.2 in Debian 12) this seems to have no effect - at least, the system is busy for over half an hour using one CPU core and did not even report the amount of pages it would parse:
(note that the JSON here has to be really well formatted, with quoted strings for object item names -- not plain tokens as in Ruby examples of the README).
I have just recently started exploring this tool, so have no idea if it actually has support for parallelized local scans.
I'd expect it to read a page, toss links to neighboring pages into the queue (unless those were already "known" - queued, being handled, or fully processed) and go on with analysis of the current page, finally pick another page from the queue, rinse, repeat.
This would be quite welcoming for parallelization (with a synchronized access to the queue singleton), with many single-thread runners processing one page each and tossing links into the queue if they are yet "unknown" and returning to pick up another page from the queue.
Again - no idea if something of the sort already exists and needs just documentation, or needs to be designed and implemented as well. But in any case, it would be really welcome to cut a (multi-)hour scan to minutes, so it would be actually helpful in regular CI sanity checks vs. one-off developer trials.
The text was updated successfully, but these errors were encountered: